Many algorithms for approximate reinforcement learning are not known to converge. In fact, there are counterexamples showing that the adjustable weights in some algorithms may oscillate within a region rather than converging to a point. This paper shows that, for two popular algorithms, such oscillation is the worst that can happen: the weights cannot diverge, but instead must converge to a bounded region. The algorithms are SARSA(0) and V(0); the latter algorithm was used in the well-known TD-Gammon program.
Gordon, G. J. (2001). Reinforcement learning with function approximation converges to a region. In NIPS (pp. 1040–1046). https://doi.org/DOI 10.1016/j.ccr.2007.10.019