Stochastic policy gradient reinforcement learning on a simple 3D biped

  • Tedrake R
  • Zhang T
  • Seung H
  • 149


    Mendeley users who have this article in their library.
  • 133


    Citations of this article.


We present a learning system which is able to quickly and reliably acquire a robust feedback control policy for 3D dynamic walking from a blank-slate using only trials implemented on our physical robot. The robot begins walking within a minute and learning converges in approximately 20 minutes. This success can be attributed to the mechanics of our robot, which are modeled after a passive dynamic walker, and to a dramatic reduction in the dimensionality of the learning problem. We reduce the dimensionality by designing a robot with only 6 internal degrees of freedom and 4 actuators, by decomposing the control system in the frontal and sagittal planes, and by formulating the learning problem on the discrete return map dynamics. We apply a stochastic policy gradient algorithm to this reduced problem and decrease the variance of the update using a state-based estimate of the expected cost. This optimized learning system works quickly enough that the robot is able to continually adapt to the terrain as it walks.

Get free article suggestions today

Mendeley saves you time finding and organizing research

Sign up here
Already have an account ?Sign in

Find this document


  • R. Tedrake

  • T.W. Zhang

  • H.S. Seung

Cite this document

Choose a citation style from the tabs below

Save time finding and organizing research with Mendeley

Sign up for free