Q(λ) with off-policy corrections

43Citations
Citations of this article
98Readers
Mendeley users who have this article in their library.
Get full text

Abstract

We propose and analyze an alternate approach to off-policy multi-step temporal difference learning, in which off-policy returns are corrected with the current Q-function in terms of rewards, rather than with the target policy in terms of transition probabilities. We prove that such approximate corrections are sufficient for off-policy convergence both in policy evaluation and control, provided certain conditions. These conditions relate the distance between the target and behavior policies, the eligibility trace parameter and the discount factor, and formalize an underlying tradeoff in off-policy TD(λ). We illustrate this theoretical relationship empirically on a continuous-state control task.

Cite

CITATION STYLE

APA

Harutyunyan, A., Bellemare, M. G., Stepleton, T., & Munos, R. (2016). Q(λ) with off-policy corrections. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 9925 LNAI, pp. 305–320). Springer Verlag. https://doi.org/10.1007/978-3-319-46379-7_21

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free