Policy gradient critics

Daan Wierstra; Jürgen Schmidhuber

Conference ProceedingsOPEN ACCESS

Policy gradient critics

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2007) 4701 LNAI 466-477

DOI: 10.1007/978-3-540-74958-5_43

8Citations

58Readers

Abstract

We present Policy Gradient Actor-Critic (PGAC), a new model-free Reinforcement Learning (RL) method for creating limited-memory stochastic policies for Partially Observable Markov Decision Processes (POMDPs) that require long-term memories of past observations and actions. The approach involves estimating a policy gradient for an Actor through a Policy Gradient Critic which evaluates probability distributions on actions. Gradient-based updates of history-conditional action probability distributions enable the algorithm to learn a mapping from memory states (or event histories) to probability distributions on actions, solving POMDPs through a combination of memory and stochasticity. This goes beyond previous approaches to learning purely reactive POMDP policies, without giving up their advantages. Preliminary results on important benchmark tasks show that our approach can in principle be used as a general purpose POMDP algorithm that solves RL problems in both continuous and discrete action domains. © Springer-Verlag Berlin Heidelberg 2007.

Cite

CITATION STYLE

APA

Wierstra, D., & Schmidhuber, J. (2007). Policy gradient critics. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 4701 LNAI, pp. 466–477). Springer Verlag. https://doi.org/10.1007/978-3-540-74958-5_43

Policy gradient critics

Abstract

Cite

Register to see more suggestions