This artice is free to access.
TD(λ) is a popular family of algorithms for approximate policy evaluation in large MDPs. TD(λ) works by incrementally updating the value function after each observed transition. It has two major drawbacks: it may make inefficient use of data, and it requires the user to manually tune a stepsize schedule for good performance. For the case of linear value function approximations and λ = 0, the Least-Squares TD (LSTD) algorithm of Bradtke and Barto (1996, Machine learning, 22:1-3,33-57) eliminates all stepsize parameters and improves data efficiency. This paper updates Bradtke and Barto's work in three significant ways. First, it presents a simpler derivation of the LSTD algorithm. Second, it generalizes from λ = 0 to arbitrary values of λ at the extreme of λ = 1, the resulting new algorithm is shown to be a practical, incremental formation of supervised linear regression. Third, it presents a novel and intuitive interpretation of LSTD as a model-based reinforcement learning technique.
Boyan, J. A. (2002). Technical update: Least-squares temporal difference learning. Machine Learning, 49(2–3), 233–246. https://doi.org/10.1023/A:1017936530646