Monte Carlo Bias Correction in Q-Learning

0Citations
Citations of this article
1Readers
Mendeley users who have this article in their library.
Get full text

Abstract

The Q-learning algorithm suffers from overestimation bias due to the maximum operator appearing in its update rule. Other popular variants of Q-learning, like double Q-learning, can on the other hand cause underestimation of the action values. In many stochastic environments both underestimation and overestimation can lead to sub-optimal strategies. In this paper, we present a variation of Q-learning that uses elements from Monte-Carlo Reinforcement Learning to correct for the overestimation bias. Our method 1) makes no assumptions on the distributions of the action values or the rewards, 2) does not require extensive hyperparameter tuning unlike other popular variants proposed to deal with the overestimation bias and 3) requires storing only two estimators, similar to double Q-learning, along with the most recent episode. Our method is shown to effectively control for the overestimation bias in a number of simulated stochastic environments leading to better policies with higher cumulative rewards and action values that are closer to the optimal ones, as compared to a number of well-established approaches.

Author supplied keywords

Cite

CITATION STYLE

APA

Papadimitriou, D. (2023). Monte Carlo Bias Correction in Q-Learning. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 13539 LNAI, pp. 343–352). Springer Science and Business Media Deutschland GmbH. https://doi.org/10.1007/978-3-031-19907-3_33

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free