On principled entropy exploration in policy optimization

19Citations
Citations of this article
30Readers
Mendeley users who have this article in their library.

Abstract

In this paper, we investigate Exploratory Conservative Policy Optimization (ECPO), a policy optimization strategy that improves exploration behavior while assuring monotonic progress in a principled objective. ECPO conducts maximum entropy exploration within a mirror descent framework, but updates policies using reversed KL projection. This formulation bypasses undesirable mode seeking behavior and avoids premature convergence to suboptimal policies, while still supporting strong theoretical properties such as guaranteed policy improvement. Experimental evaluations demonstrate that the proposed method significantly improves practical exploration and surpasses the empirical performance of state-of-the art policy optimization methods in a set of benchmark tasks.

Cite

CITATION STYLE

APA

Mei, J., Xiao, C., Huang, R., Schuurmans, D., & Müller, M. (2019). On principled entropy exploration in policy optimization. In IJCAI International Joint Conference on Artificial Intelligence (Vol. 2019-August, pp. 3130–3136). International Joint Conferences on Artificial Intelligence. https://doi.org/10.24963/ijcai.2019/434

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free