We consider the problems of learning the optimal action-value function and the optimal policy in discounted-reward Markov decision processes (MDPs). We prove new PAC bounds on the sample-complexity of two well-known model-based reinforcement learning (RL) algorithms in the presence of a generative model of the MDP: value iteration and policy iteration. The first result indicates that for an MDP with N state-action pairs and the discount factor γâ̂̂[0,1) only O(Nlog(N/δ)/((1-γ)3 ε 2)) state-transition samples are required to find an ε-optimal estimation of the action-value function with the probability (w.p.) 1-δ. Further, we prove that, for small values of ε, an order of O(Nlog(N/δ)/((1-γ)3 ε 2)) samples is required to find an ε-optimal policy w.p. 1-δ. We also prove a matching lower bound of Θ(Nlog(N/δ)/((1-γ)3 ε 2)) on the sample complexity of estimating the optimal action-value function with ε accuracy. To the best of our knowledge, this is the first minimax result on the sample complexity of RL: the upper bounds match the lower bound in terms of N, ε, δ and 1/(1-γ) up to a constant factor. Also, both our lower bound and upper bound improve on the state-of-the-art in terms of their dependence on 1/(1-γ). © 2013 The Author(s).
CITATION STYLE
Gheshlaghi Azar, M., Munos, R., & Kappen, H. J. (2013). Minimax PAC bounds on the sample complexity of reinforcement learning with a generative model. Machine Learning, 91(3), 325–349. https://doi.org/10.1007/s10994-013-5368-1
Mendeley helps you to discover research relevant for your work.