MathJax Example

🔥[2024-01-16]: Our paper Feasible Reachable Policy Iteration has been accepted by ICML 2024.

FRPI is an efficient policy space pruning algorithm of safe reinforcement learning for goal-reaching problems with safety constraints, which achieves the best performance both in safety and return.

The goal-reaching tasks with safety constraints are common control problems in real world, such as intelligent driving and robot manipulation. The difficulty of this kind of problem comes from the exploration termination caused by safety constraints and the sparse rewards caused by goals. The existing safe RL avoids unsafe exploration by restricting the search space to a feasible region, the essence of which is the pruning of the search space. However, there are still many ineffective explorations in the feasible region because of the ignorance of the goals. Our approach considers both safety and goals; the policy space pruning is achieved by a function called feasible reachable function, which describes whether there is a policy to make the agent safely reach the goals in the finite time domain. This function naturally satisfies the self-consistent condition and the risky Bellman equation, which can be solved by the fixed point iteration method. On this basis, we propose feasible reachable policy iteration (FRPI), which is divided into three steps: policy evaluation, region expansion, and policy improvement. In the region expansion step, by using the information of agent to reach the goals, the convergence of the feasible region is accelerated, and simultaneously a smaller feasible reachable region is identified. The experimental results verify the effectiveness of the proposed FR function in both improving the convergence speed of better or comparable performance without sacrificing safety and identifying a smaller policy space with higher sample efficiency.

Only a tiny subset of state space is valuable to explore, where the target state can be reached in a finite horizon, and the safety constraints can be persistently satisfied.

Figure 1: The trajectories generated by policy in state space can be divided into four categories:
(1) trajectories that successfully reach the target set without violating constraints, which is feasible reachable.
(2) trajectories that can reach the target set but violate constraints, which is infeasible.
(3) trajectories that can never reach the target set but never violate the constraints, which is unreachable.
(4) trajectories that neither reach the target nor guarantee persistent constraint satisfaction, which is both infeasible and unreachable.

Figure 2:The intuitive relationship among the state space. Constrained set, feasible region, reachable region, feasible reachable region (FR Region). \[\mathrm{X}_{\mathrm{FR}}^{*} \subseteq (X^*_{\mathrm{feas}}\ \cap \mathrm{X}_{\mathrm{reach}}^*).\]

\[F^\pi(x_0) =g(x_0) + c(x_0) + \\ \quad\sum_{m=1}^{T} \prod_{n=0}^{m-1} (1+c(x_{n}))(1 - g(x_{n}))\gamma^{n} (g(x_m)+c(x_m)) \] where \[g(x) = \textbf{1}_{x_{\mathrm{goal}}}(x)\] indicating whether the target set is reached, \[c(x) = -\textbf{1}_{\bar{x}_{\mathrm{cstr}}}(x)\] indicating whether a state constraint is violated.

Figure 3: Forward Feasible Reachable Policy Tree by FR Function Identification. The green zone represents the feasible policy, while the red zone represents the FR policy. For a given state, actions are constrained to the feasible reachable action set.

Inside the region: \[ \pi_{k+1}(x) = \arg \underset{u}{\max} r(x, u) + \gamma V^{\pi_k}(x') \] \[\text{s.t. } F^{\pi_k}(x') > 0\] Outside the region: \[ \pi_{k+1}(x) = \arg \max_u F^{\pi_k}(x') \] Feasible Reachable Bellman Equation: \[V(x) = \max_{u \in \mathrm{U}^*(x)} r(x, u) + \gamma V(x') \quad V: \mathrm{X}^{*} \rightarrow \mathbb{R}\] Risky Bellman Equation: \[F(x) = c(x) + g(x) + (1 + c(x))(1-g(x))\gamma \max_{u} F(x') \]

Figure 4: The policy convergence by Q-learning(Penalty) and FRPI. FRPI pruned the policy space significantly, achieving better experimental results regardless of the environmental dimension.

Adaptive Cruise Control (ACC):The goal of ACC is to control a following vehicle to converge to a fixed distance with respect to a leading vehicle.

Figure 5: The region-expansion of ACC.

Quadrotor Trajectory Tracking:Different from the previous stabilization task, the Quadrotor is a trajectory tracking task that comes from safe-control-gym , where a 2D quadrotor is required to follow a circular trajectory in the vertical plane while keeping the vertical position in a particular range.

Figure 6: The region-expansion of Quadrotor.

FRPI-SAC achieves near-zero constraint violations on all tasks, demonstrating low and stable episode cost curves. In comparison, the curves of SAC and SAC-Lag failed to converge to zero or have severe fluctuations. Moreover, our proposed algorithm also exhibits outstanding returns on all the tasks, both in convergence speed, stable training processing, and final performance. Although some non-constraint algorithms like SAC achieve close returns, it comes with sacrificing safety.

Figure 7: Optimization results for safety-gym.

BibTeX


        @inproceedings{qinfeasible,
          title={Feasible Reachable Policy Iteration},
          author={Qin, Shentao and Yang, Yujie and Mu, Yao and Li, Jie and Zou, Wenjun and Li, Shengbo Eben and Duan, Jingliang},
          booktitle={Forty-first International Conference on Machine Learning},
          year={2024},
          url={https://openreview.net/pdf?id=ks8qSwkkuZ}
        }

FRPI

Feasible Reachable Policy Iteration

🔔News

Abstraction

Methods

Feasible Reachable Region Identification

Feasible Reachable Function

Feasible Reachable Policy Iteration

Experiment Results

Main Results

The region-expansion Experiment

The Experiment on Safety Gym

BibTeX