Learning curve bounds for a Markov decision process with undiscounted rewards

2Citations
Citations of this article
35Readers
Mendeley users who have this article in their library.
Get full text

Abstract

The goal of learning in Markov decision-processes is to find a policy that yields the maximum expected return over time. In problems with large state spaces, computing these averages directly is not feasible; instead, the agent must estimate them by stochastic exploration of the state space. Using methods from statistical mechanics, we study how the agent's performance depends on the allowed exploration time. In particular, for a simple control problem with undiscounted rewards, qq compute a lower bound on the return of policies that appear optimal based on imperqq statistics. This is done in the thermodynamic limit: T→∞, N→∞, α = T/N qqfinite), where T is the number of time steps alotted per policy evaluation and N is the size of the state space.

Cite

CITATION STYLE

APA

Saul, L. K., & Singh, S. P. (1996). Learning curve bounds for a Markov decision process with undiscounted rewards. In Proceedings of the Annual ACM Conference on Computational Learning Theory (pp. 147–156). https://doi.org/10.1145/238061.238084

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free