Skip to main content

Technical update: Least-squares temporal difference learning

Citations of this article
Mendeley users who have this article in their library.

This artice is free to access.


TD(λ) is a popular family of algorithms for approximate policy evaluation in large MDPs. TD(λ) works by incrementally updating the value function after each observed transition. It has two major drawbacks: it may make inefficient use of data, and it requires the user to manually tune a stepsize schedule for good performance. For the case of linear value function approximations and λ = 0, the Least-Squares TD (LSTD) algorithm of Bradtke and Barto (1996, Machine learning, 22:1-3,33-57) eliminates all stepsize parameters and improves data efficiency. This paper updates Bradtke and Barto's work in three significant ways. First, it presents a simpler derivation of the LSTD algorithm. Second, it generalizes from λ = 0 to arbitrary values of λ at the extreme of λ = 1, the resulting new algorithm is shown to be a practical, incremental formation of supervised linear regression. Third, it presents a novel and intuitive interpretation of LSTD as a model-based reinforcement learning technique.




Boyan, J. A. (2002). Technical update: Least-squares temporal difference learning. Machine Learning, 49(2–3), 233–246.

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free