Empirical Q-Value Iteration

5Citations
Citations of this article
13Readers
Mendeley users who have this article in their library.

Abstract

We propose a new simple and natural algorithm for learning the optimal Q-value function of a discounted-cost Markov decision process (MDP) when the transition kernels are unknown. Unlike the classical learning algorithms for MDPs, such as Q-learning and actor-critic algorithms, this algorithm does not depend on a stochastic approximation-based method. We show that our algorithm, which we call the empirical Q-value iteration algorithm, converges to the optimal Q-value function. We also give a rate of convergence or a nonasymptotic sample complexity bound and show that an asynchronous (or online) version of the algorithm will also work. Preliminary experimental results suggest a faster rate of convergence to a ballpark estimate for our algorithm compared with stochastic approximation-based algorithms.

Cite

CITATION STYLE

APA

Kalathil, D., Borkar, V. S., & Jain, R. (2021). Empirical Q-Value Iteration. Stochastic Systems, 11(1), 1–18. https://doi.org/10.1287/stsy.2019.0062

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free