Reinforcement Learning in Finite MDPs : PAC Analysis

  • Strehl A
  • Li L
  • Littman M
  • 111


    Mendeley users who have this article in their library.
  • 66


    Citations of this article.


We study the problem of learning near-optimal behavior in finite Markov Decision Processes (MDPs) with a polynomial number of samples. These "PAC-MDP" algorithms include the well-known E3 and R-MAX algorithms as well as the more recent Delayed Q-learning algorithm. We summarize the current state-of-the-art by presenting bounds for the problem in a unified theoretical framework. A more refined analysis for upper and lower bounds is presented to yield insight into the differences between the model-free Delayed Q-learning and the model-based R-MAX.

Author-supplied keywords

  • exploration
  • markov decision processes
  • pac mdp
  • reinforcement learning
  • sample

Get free article suggestions today

Mendeley saves you time finding and organizing research

Sign up here
Already have an account ?Sign in

Find this document


  • Alexander L Strehl

  • Lihong Li

  • Michael L Littman

Cite this document

Choose a citation style from the tabs below

Save time finding and organizing research with Mendeley

Sign up for free