Abstract
The goal of learning in Markov decision-processes is to find a policy that yields the maximum expected return over time. In problems with large state spaces, computing these averages directly is not feasible; instead, the agent must estimate them by stochastic exploration of the state space. Using methods from statistical mechanics, we study how the agent's performance depends on the allowed exploration time. In particular, for a simple control problem with undiscounted rewards, qq compute a lower bound on the return of policies that appear optimal based on imperqq statistics. This is done in the thermodynamic limit: T→∞, N→∞, α = T/N qqfinite), where T is the number of time steps alotted per policy evaluation and N is the size of the state space.
Cite
CITATION STYLE
Saul, L. K., & Singh, S. P. (1996). Learning curve bounds for a Markov decision process with undiscounted rewards. In Proceedings of the Annual ACM Conference on Computational Learning Theory (pp. 147–156). https://doi.org/10.1145/238061.238084
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.