Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path

András Antos; Csaba Szepesvári; Rémi Munos

Journal ArticleOPEN ACCESS

Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path

Machine Learning (2008) 71(1) 89-129

DOI: 10.1007/s10994-007-5038-2

345Citations

124Readers

Abstract

In this paper we consider the problem of finding a near-optimal policy in a continuous space, discounted Markovian Decision Problem (MDP) by employing value-function-based methods when only a single trajectory of a fixed policy is available as the input. We study a policy-iteration algorithm where the iterates are obtained via empirical risk minimization with a risk function that penalizes high magnitudes of the Bellman-residual. Our main result is a finite-sample, high-probability bound on the performance of the computed policy that depends on the mixing rate of the trajectory, the capacity of the function set as measured by a novel capacity concept (the VC-crossing dimension), the approximation power of the function set and the controllability properties of the MDP. Moreover, we prove that when a linear parameterization is used the new algorithm is equivalent to Least-Squares Policy Iteration. To the best of our knowledge this is the first theoretical result for off-policy control learning over continuous state-spaces using a single trajectory. © 2007 Springer Science+Business Media, LLC.

Author supplied keywords

Cite

CITATION STYLE

APA

Antos, A., Szepesvári, C., & Munos, R. (2008). Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path. Machine Learning, 71(1), 89–129. https://doi.org/10.1007/s10994-007-5038-2

Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path

Abstract

Author supplied keywords

Cite

Register to see more suggestions