An online policy gradient algorithm for Markov decision processes with continuous states and actions

1Citations
Citations of this article
3Readers
Mendeley users who have this article in their library.

This article is free to access.

Abstract

We consider the learning problem under an online Markov decision process (MDP), which is aimed at learning the time-dependent decision-making policy of an agent that minimizes the regret - the difference from the best fixed policy. The difficulty of online MDP learning is that the reward function changes over time. In this paper, we show that a simple online policy gradient algorithm achieves regret O(√T) for T steps under a certain concavity assumption and O(logT) under a strong concavity assumption. To the best of our knowledge, this is the first work to give an online MDP algorithm that can handle continuous state, action, and parameter spaces with guarantee. We also illustrate the behavior of the online policy gradient method through experiments. © 2014 Springer-Verlag.

Cite

CITATION STYLE

APA

Ma, Y., Zhao, T., Hatano, K., & Sugiyama, M. (2014). An online policy gradient algorithm for Markov decision processes with continuous states and actions. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 8725 LNAI, pp. 354–369). Springer Verlag. https://doi.org/10.1007/978-3-662-44851-9_23

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free