Policy Optimization with Stochastic Mirror Descent

16Citations
Citations of this article
21Readers
Mendeley users who have this article in their library.

Abstract

Improving sample efficiency has been a longstanding goal in reinforcement learning. This paper proposes VRMPO algorithm: a sample efficient policy gradient method with stochastic mirror descent. In VRMPO, a novel variance-reduced policy gradient estimator is presented to improve sample efficiency. We prove that the proposed VRMPO needs only O(ε−3) sample trajectories to achieve an ε-approximate first-order stationary point, which matches the best sample complexity for policy optimization. Extensive empirical results demonstrate that VRMPO outperforms the state-of-the-art policy gradient methods in various settings.

Cite

CITATION STYLE

APA

Yang, L., Zhang, Y., Zheng, G., Zheng, Q., Li, P., Huang, J., & Pan, G. (2022). Policy Optimization with Stochastic Mirror Descent. In Proceedings of the 36th AAAI Conference on Artificial Intelligence, AAAI 2022 (Vol. 36, pp. 8823–8831). Association for the Advancement of Artificial Intelligence. https://doi.org/10.1609/aaai.v36i8.20863

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free