There has been a problem called "exploration-exploitation problem" in the field of reinforcement learning. An agent must decide whether to explore a better action which may not necessarily exist, or to exploit many rewards by taking the current best action. In this article, we propose an off-policy reinforcement learning method based on a natural policy gradient learning, as a solution of the exploration-exploitation problem. In our method, the policy gradient is estimated based on a sequence of state-action pairs sampled by performing an arbitrary "behavior policy"; this allows us to deal with the exploration-exploitation problem by handling the generation process of behavior policies. By applying to an autonomous control problem of a three-dimensional cartpole, we show that our method can realize an optimal control efficiently in a partially observable domain. © Springer-Verlag Berlin Heidelberg 2005.
CITATION STYLE
Nakamura, Y., Mori, T., & Ishii, S. (2005). An off-policy natural policy gradient method for a partial observable markov decision process. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 3697 LNCS, pp. 431–436). https://doi.org/10.1007/11550907_68
Mendeley helps you to discover research relevant for your work.