In Q-learning, the reduced chance of converging to the optimal policy is partly caused by the estimated bias of action values. The estimation of action values usually leads to biases like the overestimation and underestimation thus it hurts the current policy. The values produced by the maximization operator are overestimated, which is well known as the maximization bias. For correcting the bias, the values are reduced towards the underestimation by the double estimators operator. However, according to the proposed analysis, the performances of the two operators (the maximization operator and the double estimators operator) rely on the undetermined dynamic of environment in which the estimated bias results from not only the difference between the current policy and optimal policy, but also the sampling error of reward. The sampling error which is increased by the operators leads to the risk of converging to the non-optimal policy. In order to reduce the risk, this paper proposes a flexible operator which takes account of the most visited action value instead of the greedy value, named Risk Aversion operator which is inspired by the humans' response to the uncertainty. Based on this operator, the Risk Aversion Q-learning is proposed; the boundary of action values and the convergence are proven. In three demonstration tasks whose optimal policy is known, the proposed algorithm increases the chance of converging to the optimal policy.
CITATION STYLE
Wang, B., Li, X., Gao, Z., & Zhong, Y. (2020). Risk Aversion Operator for Addressing Maximization Bias in Q-Learning. IEEE Access, 8, 43098–43110. https://doi.org/10.1109/ACCESS.2020.2977400
Mendeley helps you to discover research relevant for your work.