Learning curve bounds for a Markov decision process with undiscounted rewards

Lawrence K. Saul; Satinder P. Singh

Conference Proceedings

Learning curve bounds for a Markov decision process with undiscounted rewards

Proceedings of the Annual ACM Conference on Computational Learning Theory (1996) 147-156

DOI: 10.1145/238061.238084

2Citations

35Readers

Get full text

Abstract

The goal of learning in Markov decision-processes is to find a policy that yields the maximum expected return over time. In problems with large state spaces, computing these averages directly is not feasible; instead, the agent must estimate them by stochastic exploration of the state space. Using methods from statistical mechanics, we study how the agent's performance depends on the allowed exploration time. In particular, for a simple control problem with undiscounted rewards, qq compute a lower bound on the return of policies that appear optimal based on imperqq statistics. This is done in the thermodynamic limit: T→∞, N→∞, α = T/N qqfinite), where T is the number of time steps alotted per policy evaluation and N is the size of the state space.

Cite

CITATION STYLE

APA

Saul, L. K., & Singh, S. P. (1996). Learning curve bounds for a Markov decision process with undiscounted rewards. In Proceedings of the Annual ACM Conference on Computational Learning Theory (pp. 147–156). https://doi.org/10.1145/238061.238084

Learning curve bounds for a Markov decision process with undiscounted rewards

Abstract

Cite

Register to see more suggestions