Sign up & Download
Sign in

Apprenticeship learning via inverse reinforcement learning

by Pieter Abbeel, Andrew Y Ng
Twentyfirst international conference on Machine learning ICML 04 (2004)

Abstract

This paper formulates apprenticeship learning as learning from an expert with an unknown reward function. It is assumed that this function is a linear combination of known features, and this is learned from the expert using inverse RL. The apprentice is represented by an MDP with a finite set of actions, and the algorithm learns this function from data consisting of sets of features generated by the expert. This enables the agent to mimic different driving styles in a driving simulator.

Cite this document (BETA)

Available from portal.acm.org
Page 1
hidden

Apprenticeship learning via inverse reinforcement learning

Apprenticeship Learning via Inverse Reinforcement Learning
Pieter Abbeel pabbeel@cs.stanford.edu
Andrew Y. Ng ang@cs.stanford.edu
Computer Science Department, Stanford University, Stanford, CA 94305, USA
Abstract
We consider learning in a Markov decision
process where we are not explicitly given a re-
ward function, but where instead we can ob-
serve an expert demonstrating the task that
we want to learn to perform. This setting
is useful in applications (such as the task of
driving) where it may be difficult to write
down an explicit reward function specifying
exactly how different desiderata should be
traded off. We think of the expert as try-
ing to maximize a reward function that is ex-
pressible as a linear combination of known
features, and give an algorithm for learning
the task demonstrated by the expert. Our al-
gorithm is based on using “inverse reinforce-
ment learning” to try to recover the unknown
reward function. We show that our algorithm
terminates in a small number of iterations,
and that even though we may never recover
the expert’s reward function, the policy out-
put by the algorithm will attain performance
close to that of the expert, where here per-
formance is measured with respect to the ex-
pert’s unknown reward function.
1. Introduction
Given a sequential decision making problem posed in
the Markov decision process (MDP) formalism, a num-
ber of standard algorithms exist for finding an optimal
or near-optimal policy. In the MDP setting, we typi-
cally assume that a reward function is given. Given a
reward function and the MDPs state transition prob-
abilities, the value function and optimal policy are ex-
actly determined.
The MDP formalism is useful for many problems be-
cause it is often easier to specify the reward function
than to directly specify the value function (and/or op-
timal policy). However, we believe that even the re-
ward function is frequently difficult to specify manu-
ally. Consider, for example, the task of highway driv-
ing. When driving, we typically trade off many dif-
Appearing in Proceedings of the 21 st International Confer-
ence on Machine Learning, Banff, Canada, 2004. Copyright
2004 by the authors.
ferent desiderata, such as maintaining safe following
distance, keeping away from the curb, staying far from
any pedestrians, maintaining a reasonable speed, per-
haps a slight preference for driving in the middle lane,
not changing lanes too often, and so on . . . . To specify
a reward function for the driving task, we would have
to assign a set of weights stating exactly how we would
like to trade off these different factors. Despite being
able to drive competently, the authors do not believe
they can confidently specify a specific reward function
for the task of “driving well.”1
In practice, this means that the reward function is of-
ten manually tweaked (cf. reward shaping, Ng et al.,
1999) until the desired behavior is obtained. From con-
versations with engineers in industry and our own ex-
perience in applying reinforcement learning algorithms
to several robots, we believe that, for many problems,
the difficulty of manually specifying a reward function
represents a significant barrier to the broader appli-
cability of reinforcement learning and optimal control
algorithms.
When teaching a young adult to drive, rather than
telling them what the reward function is, it is much
easier and more natural to demonstrate driving to
them, and have them learn from the demonstration.
The task of learning from an expert is called appren-
ticeship learning (also learning by watching, imitation
learning, or learning from demonstration).
A number of approaches have been proposed for ap-
prenticeship learning in various applications. Most of
these methods try to directly mimic the demonstrator
by applying a supervised learning algorithm to learn a
direct mapping from the states to the actions. This
literature is too wide to survey here, but some ex-
amples include Sammut et al. (1992); Kuniyoshi et
al. (1994); Demiris & Hayes (1994); Amit & Mataric
(2002); Pomerleau (1989). One notable exception is
given in Atkeson & Schaal (1997). They considered the
1We note that this is true even though the reward func-
tion may often be easy to state in English. For instance,
the “true” reward function that we are trying to maximize
when driving is, perhaps, our “personal happiness.” The
practical problem however is how to model this (i.e., our
happiness) explicitly as a function of the problems’ states,
so that a reinforcement learning algorithm can be applied.
Page 2
hidden
problem of having a robot arm follow a demonstrated
trajectory, and used a reward function that quadrat-
ically penalizes deviation from the desired trajectory.
Note however, that this method is applicable only to
problems where the task is to mimic the expert’s tra-
jectory. For highway driving, blindly following the ex-
pert’s trajectory would not work, because the pattern
of traffic encountered is different each time.
Given that the entire field of reinforcement learning is
founded on the presupposition that the reward func-
tion, rather than the policy or the value function, is
the most succinct, robust, and transferable definition
of the task, it seems natural to consider an approach to
apprenticeship learning whereby the reward function is
learned.2
The problem of deriving a reward function from ob-
served behavior is referred to as inverse reinforcement
learning (Ng & Russell, 2000). In this paper, we
assume that the expert is trying (without necessar-
ily succeeding) to optimize an unknown reward func-
tion that can be expressed as a linear combination of
known “features.” Even though we cannot guarantee
that our algorithms will correctly recover the expert’s
true reward function, we show that our algorithm will
nonetheless find a policy that performs as well as the
expert, where performance is measured with respect
to the expert’s unknown reward function.
2. Preliminaries
A (finite-state) Markov decision process (MDP) is a tu-
ple (S,A, T, γ,D,R), where S is a finite set of states; A
is a set of actions; T = {Psa} is a set of state transition
probabilities (here, Psa is the state transition distribu-
tion upon taking action a in state s); γ ∈ [0, 1) is a
discount factor; D is the initial-state distribution, from
which the start state s0 is drawn; and R : S 7→ A is
the reward function, which we assume to be bounded
in absolute value by 1. We let MDP\R denote an
MDP without a reward function, i.e., a tuple of the
form (S,A, T, γ,D).
We assume that there is some vector of features φ :
S → [0, 1]k over states, and that there is some “true”
reward function R∗(s) = w∗ · φ(s), where w∗ ∈ Rk. 3
2A related idea is also seen in the biomechanics and cog-
nitive science, where researchers have pointed out that sim-
ple reward functions (usually ones constructed by hand) of-
ten suffice to explain complicated behavior (policies). Ex-
amples include the minimum jerk principle to explain limb
movement in primates (Hogan, 1984), and the minimum
torque-change model to explain trajectories in human mul-
tijoint arm movement.(Uno et al., 1989) Related examples
are also found in economics and some other literatures.
(See the discussion in Ng & Russell, 2000.)
3The case of state-action rewards R(s, a) offers no ad-
ditional difficulties; using features of the form φ : S ×A →
[0, 1]k, and our algorithms still apply straightforwardly.
In order to ensure that the rewards are bounded by 1,
we also assume ‖w∗‖1 ≤ 1. In the driving domain, φ
might be a vector of features indicating the different
desiderata in driving that we would like to trade off,
such as whether we have just collided with another
car, whether we’re driving in the middle lane, and so
on. The (unknown) vector w∗ specifies the relative
weighting between these desiderata.
A policy pi is a mapping from states to probability
distributions over actions. The value of a policy pi is
Es0∼D[V pi(s0)] = E[
∑∞
t=0 γtR(st)|pi] (1)
= E[
∑∞
t=0 γtw · φ(st)|pi] (2)
= w · E[∑∞t=0 γtφ(st)|pi] (3)
Here, the expectation is taken with respect to the ran-
dom state sequence s0, s1, . . . drawn by starting from
a state s0 ∼ D, and picking actions according to pi.
We define the expected discounted accumulated fea-
ture value vector µ(pi), or more succinctly the feature
expectations, to be
µ(pi) = E[∑∞t=0 γtφ(st)|pi] ∈ Rk. (4)
Using this notation, the value of a policy may be writ-
ten Es0∼D[V pi(s0)] = w · µ(pi). Given that the reward
R is expressible as a linear combination of the fea-
tures φ, the feature expectations for a given policy pi
completely determine the expected sum of discounted
rewards for acting according to that policy.
Let Π denote the set of stationary policies for an MDP.
Given two policies pi1, pi2 ∈ Π, we can construct a new
policy pi3 by mixing them together. Specifically, imag-
ine that pi3 operates by flipping a coin with bias λ, and
with probability λ picks and always acts according to
pi1, and with probability 1 − λ always acts according
to pi2. From linearity of expectation, clearly we have
that µ(pi3) = λµ(pi1) + (1 − λ)µ(pi2). Note that the
randomization step selecting between pi1 and pi2 occurs
only once at the start of a trajectory, and not on ev-
ery step taken in the MDP. More generally, if we have
found some set of policies pi1, . . . , pid, and want to find
a new policy whose feature expectations vector is a
convex combination ∑ni=1 λiµ(pii) (λi ≥ 0,

i λi = 1)
of these policies’, then we can do so by mixing together
the policies pi1, . . . , pid, where the probability of picking
pii is given by λi.
We assume access to demonstrations by some expert
piE . Specifically, we assume the ability to observe
trajectories (state sequences) generated by the expert
starting from s0 ∼ D and taking actions according to
piE . It may be helpful to think of the piE as the optimal
policy under the reward function R∗ = w∗T φ, though
we do not require this to hold.
For our algorithm, we will require an estimate of the
expert’s feature expectations µE = µ(piE). Specifi-

Sign up today - FREE

Mendeley saves you time finding and organizing research. Learn more

  • All your research in one place
  • Add and import papers easily
  • Access it anywhere, anytime

Start using Mendeley in seconds!

Already have an account? Sign in

Readership Statistics

44 Readers on Mendeley
by Discipline
 
 
 
by Academic Status
 
55% Ph.D. Student
 
16% Post Doc
 
7% Student (Bachelor)
by Country
 
48% United States
 
18% United Kingdom
 
9% Japan