Estimating probability distributions on returns provides various sophisticated decision making schemes for control problems in Markov environments, including risk-sensitive control, efficient exploration of environments and so on. Many reinforcement learning algorithms, however, have simply relied on the expected return. This paper provides a scheme of decision making using mean and variance of returndistributions. This paper presents a TD algorithm for estimating the variance of return in MDP(Markov decision processes) environments and a gradient-based reinforcement learning algorithm on the variance penalized criterion, which is a typical criterion in risk-avoiding control. Empirical results demonstrate behaviors of the algorithms and validates of the criterion for risk-avoiding sequential decision tasks.
CITATION STYLE
Sato, M., Kimura, H., & Kobayashi, S. (2001). TD algorithm for the variance of return and mean-variance reinforcement learning. Transactions of the Japanese Society for Artificial Intelligence, 16(3), 353–362. https://doi.org/10.1527/tjsai.16.353
Mendeley helps you to discover research relevant for your work.