In reinforcement learning an autonomous agent learns an optimal policy while interacting with the environment. In particular, in one-step Q-learning, with each action an agent updates its Q values con- sidering immediate rewards. In this paper a new strategy for updating Q values is proposed. The strategy, implemented in an algorithm called DQL, uses a set of agents all searching the same goal in the same space to obtain the same optimal policy. Each agent leaves traces over a copy of the environment (copies of Q-values), while searching for a goal. These copies are used by the agents to decide which actions to take. Once all the agents reach a goal, the original Q-values of the best solution found by all the agents are updated using Watkins' Q-learning formula. DQL has some similarities with Gambardella's Ant-Q algorithm [4], however it does not require the de_nition of a domain dependent heuristic and con- sequently the tuning of additional parameters. DQL also does not update the original Q-values with zero reward while the agents are searching, as Ant-Q does. It is shown how DQL's guided exploration of several agents with selected exploitation (updating only the best solution) produces faster convergence times than Q-learning and Ant-Q on several testbed problems under similar conditions.
CITATION STYLE
Mariano, C. E., & Morales, E. F. (2001). DQL: A new updating strategy for reinforcement learning based on Q-learning. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 2167, pp. 324–335). Springer Verlag. https://doi.org/10.1007/3-540-44795-4_28
Mendeley helps you to discover research relevant for your work.