Sample-Efficient Model-Free Reinforcement Learning with Off-Policy Critics

Denis Steckelmacher; Hélène Plisnier; Diederik M. Roijers; Ann Nowé

Conference Proceedings

Sample-Efficient Model-Free Reinforcement Learning with Off-Policy Critics

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2020) 11908 LNAI 19-34

DOI: 10.1007/978-3-030-46133-1_2

2Citations

50Readers

Get full text

Abstract

Value-based reinforcement-learning algorithms provide state-of-the-art results in model-free discrete-action settings, and tend to outperform actor-critic algorithms. We argue that actor-critic algorithms are limited by their need for an on-policy critic. We propose Bootstrapped Dual Policy Iteration (BDPI), a novel model-free reinforcement-learning algorithm for continuous states and discrete actions, with an actor and several off-policy critics. Off-policy critics are compatible with experience replay, ensuring high sample-efficiency, without the need for off-policy corrections. The actor, by slowly imitating the average greedy policy of the critics, leads to high-quality and state-specific exploration, which we compare to Thompson sampling. Because the actor and critics are fully decoupled, BDPI is remarkably stable, and unusually robust to its hyper-parameters. BDPI is significantly more sample-efficient than Bootstrapped DQN, PPO, and ACKTR, on discrete, continuous and pixel-based tasks. Source code: https://github.com/vub-ai-lab/bdpi. Appendix: https://arxiv.org/abs/1903.04193.

Author supplied keywords

Cite

CITATION STYLE

APA

Steckelmacher, D., Plisnier, H., Roijers, D. M., & Nowé, A. (2020). Sample-Efficient Model-Free Reinforcement Learning with Off-Policy Critics. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 11908 LNAI, pp. 19–34). Springer. https://doi.org/10.1007/978-3-030-46133-1_2

Sample-Efficient Model-Free Reinforcement Learning with Off-Policy Critics

Abstract

Author supplied keywords

Cite

Register to see more suggestions