Logo FRPI

Feasible Reachable Policy Iteration

Shentao Qin*†,1, Yujie Yang*1, Yao Mu*2, Jie Li1,
Wenjun Zhou1, Jingliang Duan3, Shengbo Eben Li✉1,

1School of Vehicle and Mobility, Tsinghua University
2Department of Computer Science, The University of Hong Kong
3School of Mechanical Engineering, University of Science and Technology Beijing

*Equal contribution, ✉Corresponding author
†Project Lead: qst23@mails.tsinghua.edu.cn
geometric reasoning

🔔News

🔥[2024-01-16]: Our paper Feasible Reachable Policy Iteration has been accepted by ICML 2024.

Abstraction

FRPI is an efficient policy space pruning algorithm of safe reinforcement learning for goal-reaching problems with safety constraints, which achieves the best performance both in safety and return.

The goal-reaching tasks with safety constraints are common control problems in real world, such as intelligent driving and robot manipulation. The difficulty of this kind of problem comes from the exploration termination caused by safety constraints and the sparse rewards caused by goals. The existing safe RL avoids unsafe exploration by restricting the search space to a feasible region, the essence of which is the pruning of the search space. However, there are still many ineffective explorations in the feasible region because of the ignorance of the goals. Our approach considers both safety and goals; the policy space pruning is achieved by a function called feasible reachable function, which describes whether there is a policy to make the agent safely reach the goals in the finite time domain. This function naturally satisfies the self-consistent condition and the risky Bellman equation, which can be solved by the fixed point iteration method. On this basis, we propose feasible reachable policy iteration (FRPI), which is divided into three steps: policy evaluation, region expansion, and policy improvement. In the region expansion step, by using the information of agent to reach the goals, the convergence of the feasible region is accelerated, and simultaneously a smaller feasible reachable region is identified. The experimental results verify the effectiveness of the proposed FR function in both improving the convergence speed of better or comparable performance without sacrificing safety and identifying a smaller policy space with higher sample efficiency.

Methods

Feasible Reachable Region Identification

Only a tiny subset of state space is valuable to explore, where the target state can be reached in a finite horizon, and the safety constraints can be persistently satisfied.

algebraic reasoning

Figure 1: The trajectories generated by policy in state space can be divided into four categories:
(1) trajectories that successfully reach the target set without violating constraints, which is feasible reachable.
(2) trajectories that can reach the target set but violate constraints, which is infeasible.
(3) trajectories that can never reach the target set but never violate the constraints, which is unreachable.
(4) trajectories that neither reach the target nor guarantee persistent constraint satisfaction, which is both infeasible and unreachable.

algebraic reasoning

Figure 2:The intuitive relationship among the state space. Constrained set, feasible region, reachable region, feasible reachable region (FR Region). \[\mathrm{X}_{\mathrm{FR}}^{*} \subseteq (X^*_{\mathrm{feas}}\ \cap \mathrm{X}_{\mathrm{reach}}^*).\]

Feasible Reachable Function

FRPI is an efficient policy space pruning algorithm of safe reinforcement learning for goal-reaching problems with safety constraints, which achieves the best performance both in safety and return.

\[F^\pi(x_0) =g(x_0) + c(x_0) + \\ \quad\sum_{m=1}^{T} \prod_{n=0}^{m-1} (1+c(x_{n}))(1 - g(x_{n}))\gamma^{n} (g(x_m)+c(x_m)) \] where \[g(x) = \textbf{1}_{x_{\mathrm{goal}}}(x)\] indicating whether the target set is reached, \[c(x) = -\textbf{1}_{\bar{x}_{\mathrm{cstr}}}(x)\] indicating whether a state constraint is violated.

algebraic reasoning

Figure 3: Forward Feasible Reachable Policy Tree by FR Function Identification. The green zone represents the feasible policy, while the red zone represents the FR policy. For a given state, actions are constrained to the feasible reachable action set.

Feasible Reachable Policy Iteration

FRPI is an efficient policy space pruning algorithm of safe reinforcement learning for goal-reaching problems with safety constraints, which achieves the best performance both in safety and return.

Inside the region: \[ \pi_{k+1}(x) = \arg \underset{u}{\max} r(x, u) + \gamma V^{\pi_k}(x') \] \[\text{s.t. } F^{\pi_k}(x') > 0\] Outside the region: \[ \pi_{k+1}(x) = \arg \max_u F^{\pi_k}(x') \] Feasible Reachable Bellman Equation: \[V(x) = \max_{u \in \mathrm{U}^*(x)} r(x, u) + \gamma V(x') \quad V: \mathrm{X}^{*} \rightarrow \mathbb{R}\] Risky Bellman Equation: \[F(x) = c(x) + g(x) + (1 + c(x))(1-g(x))\gamma \max_{u} F(x') \]

Experiment Results

1. Can FR function enable an efficient policy space pruning to achieve faster convergence than other algorithms?
2. Can FRPI-SAC speed up feasible region expansion, and simultaneously identify a smaller FR region?
3. Do FRPI-SAC achieve a comparable performance faster than other algorithms without sacrificing safety?

Main Results

algebraic reasoning

Figure 4: The policy convergence by Q-learning(Penalty) and FRPI. FRPI pruned the policy space significantly, achieving better experimental results regardless of the environmental dimension.

The region-expansion Experiment

Adaptive Cruise Control (ACC):The goal of ACC is to control a following vehicle to converge to a fixed distance with respect to a leading vehicle.

algebraic reasoning

Figure 5: The region-expansion of ACC.

Quadrotor Trajectory Tracking:Different from the previous stabilization task, the Quadrotor is a trajectory tracking task that comes from safe-control-gym , where a 2D quadrotor is required to follow a circular trajectory in the vertical plane while keeping the vertical position in a particular range.

algebraic reasoning

Figure 6: The region-expansion of Quadrotor.

The Experiment on Safety Gym

FRPI-SAC achieves near-zero constraint violations on all tasks, demonstrating low and stable episode cost curves. In comparison, the curves of SAC and SAC-Lag failed to converge to zero or have severe fluctuations. Moreover, our proposed algorithm also exhibits outstanding returns on all the tasks, both in convergence speed, stable training processing, and final performance. Although some non-constraint algorithms like SAC achieve close returns, it comes with sacrificing safety.

algebraic reasoning

Figure 7: Optimization results for safety-gym.

BibTeX


        @inproceedings{qinfeasible,
          title={Feasible Reachable Policy Iteration},
          author={Qin, Shentao and Yang, Yujie and Mu, Yao and Li, Jie and Zou, Wenjun and Li, Shengbo Eben and Duan, Jingliang},
          booktitle={Forty-first International Conference on Machine Learning},
          year={2024},
          url={https://openreview.net/pdf?id=ks8qSwkkuZ}
        }