An off-policy natural policy gradient method for a partial observable markov decision process

1Citations
Citations of this article
3Readers
Mendeley users who have this article in their library.
Get full text

Abstract

There has been a problem called "exploration-exploitation problem" in the field of reinforcement learning. An agent must decide whether to explore a better action which may not necessarily exist, or to exploit many rewards by taking the current best action. In this article, we propose an off-policy reinforcement learning method based on a natural policy gradient learning, as a solution of the exploration-exploitation problem. In our method, the policy gradient is estimated based on a sequence of state-action pairs sampled by performing an arbitrary "behavior policy"; this allows us to deal with the exploration-exploitation problem by handling the generation process of behavior policies. By applying to an autonomous control problem of a three-dimensional cartpole, we show that our method can realize an optimal control efficiently in a partially observable domain. © Springer-Verlag Berlin Heidelberg 2005.

Cite

CITATION STYLE

APA

Nakamura, Y., Mori, T., & Ishii, S. (2005). An off-policy natural policy gradient method for a partial observable markov decision process. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 3697 LNCS, pp. 431–436). https://doi.org/10.1007/11550907_68

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free