Hindsight Trust Region Policy Optimization

3Citations
Citations of this article
32Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Reinforcement Learning (RL) with sparse rewards is a major challenge. We propose Hindsight Trust Region Policy Optimization (HTRPO), a new RL algorithm that extends the highly successful TRPO algorithm with hindsight to tackle the challenge of sparse rewards. Hindsight refers to the algorithm's ability to learn from information across goals, including past goals not intended for the current task. We derive the hindsight form of TRPO, together with QKL, a quadratic approximation to the KL divergence constraint on the trust region. QKL reduces variance in KL divergence estimation and improves stability in policy updates. We show that HTRPO has similar convergence property as TRPO. We also present Hindsight Goal Filtering (HGF), which further improves the learning performance for suitable tasks. HTRPO has been evaluated on various sparse-reward tasks, including Atari games and simulated robot control. Results show that HTRPO consistently outperforms TRPO, as well as HPG, a state-of-the-art policy gradient algorithm for RL with sparse rewards.

Cite

CITATION STYLE

APA

Zhang, H., Bai, S., Lan, X., Hsu, D., & Zheng, N. (2021). Hindsight Trust Region Policy Optimization. In IJCAI International Joint Conference on Artificial Intelligence (pp. 3335–3341). International Joint Conferences on Artificial Intelligence. https://doi.org/10.24963/ijcai.2021/459

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free