In this paper, we regard the sequence of returns as outputs from a parametric compound source. The coding rate of the source shows the amount of information on the return, so the information gain concerning future information is given by the sum of the discounted coding rates. We accordingly formulate a temporal difference learning for estimating the expected information gain, and give a convergence proof of the information gain under certain conditions. As an example of applications, we propose the ratio w of return loss to information gain to be used in probabilistic action selection strategies. We found in experiments that our tu-based strategy performs well compared with the conventional Q-based strategy. © Springer-Verlag 2003.
CITATION STYLE
Iwata, K., & Ikeda, K. (2004). Temporal Difference Coding in Reinforcement Learning. Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2690, 218–227. https://doi.org/10.1007/978-3-540-45080-1_30
Mendeley helps you to discover research relevant for your work.