Continuous-domain reinforcement learning using a learned qualitative state representation
Abstract
We present a method that allows an agent to learn a qualitative state representation that can be applied to reinforcement learning. By exploring the environment the agent is able to learn an abstraction that consists of landmarks that break the space into qualitative regions, and rules that predict changes in qualitative state. For each predictive rule the agent learns a context consisting of qualitative variables that predicts when the rule will be successful. The regions of this context in which the rule is likely to succeed serve as a natural goals for reinforcement learning. The reinforcement learning problems created by the agent are simple because the learned abstraction provides a mapping from the continuous input and motor variables to discrete states that aligns with the dynamics of the environment.
Continuous-domain reinforcement learning using a learned qualitative state representation
Using a Learned Qualitative State Representation
Jonathan Mugan and Benjamin Kuipers
Computer Science Department
University of Texas at Austin
Austin Texas, 78712 USA
fjmugan,kuipersg@cs.utexas.edu
Abstract
We present a method that allows an agent to learn a
qualitative state representation that can be applied to re-
inforcement learning. By exploring the environment the
agent is able to learn an abstraction that consists of land-
marks that break the space into qualitative regions, and
rules that predict changes in qualitative state. For each
predictive rule the agent learns a context consisting of
qualitative variables that predicts when the rule will be
successful. The regions of this context in with the rule is
likely to succeed serve as a natural goals for reinforce-
ment learning. The reinforcement learning problems
created by the agent are simple because the learned ab-
straction provides a mapping from the continuous input
and motor variables to discrete states that aligns with
the dynamics of the environment.
Introduction
Reinforcement learning in continuous domains is difficult
because the agent is unable to gain experience at each indi-
vidual state. This means that the agent must use an abstrac-
tion that allows it to map an infinite number of input and
motor states into a manageable number of abstracted states.
To be useful to the agent, the abstraction must discriminate
states that are different, but if the abstraction makes too
many unnecessary discriminations then learning becomes
inefficient. This balance is often achieved by having a hu-
man create and tune the abstraction.
Our approach to this problem is to use a qualitative state
representation. In (Mugan & Kuipers 2007a; 2007b), we
showed how an agent could build a qualitative representation
of its environment that is not specific to any particular goal.
The agent does this by breaking the world up into qualitative
regions using landmarks and then learning predictive rules
over changes in qualitative state.
In our approach, the agent experiences the world through
a set of continuous input and motor variables. The mo-
This work has taken place in the Intelligent Robotics Lab at
the Artificial Intelligence Laboratory, The University of Texas at
Austin. Research of the Intelligent Robotics lab is supported in
part by grants from the Texas Advanced Research Program (3658-
0170-2007), from the National Science Foundation (IIS-0413257,
IIS-0713150, and IIS-0750011), and from the National Institutes
of Health (EY016089).
tor variables and the derivatives of the input variables have
an intrinsic landmark at 0, which creates three qualitative
states for each of those variables. The continuous input vari-
ables themselves have no intrinsic landmarks, and the agent
must learn landmarks on these variables as well as additional
landmarks on the motor variables. A change in the qualita-
tive state of a variable defines an event. The agent searches
for rules that predict when one event will follow another in
time. For each learned predictive rule the agent searches
for regions in the state space where that rule will be reliable
and delimits those regions by creating new landmarks. Each
new landmark defines new events, which make it possible to
learn new predictive rules, and so on.
In this paper we show that this learned qualitative rep-
resentation enables the agent to do reinforcement learning
to perform a simple task. The agent defines its own rein-
forcement learning problems using the learned predictive
rules and landmarks. There has been previous work on
enabling an agent to define its own reinforcement learn-
ing problems. McGovern and Barto (2001) have proposed
a method whereby an agent autonomously finds subgoals
based on bottleneck states that are visited often during suc-
cessful trials. Subgoals have also been found by searching
for “access states” (Simsek & Barto 2004; Simsek, Wolfe,
& Barto 2005) that allow the agent to go from one part of
the state space to another. In this paper we use a differ-
ent approach to identifying goals for reinforcement learn-
ing, we define the goals for reinforcement learning to be the
regions defined by landmarks in which a predictive rule is
reliable. Additionally, there has been work on learning qual-
itative values given a model so that the level of abstraction is
appropriate for the task (Sachenbacher & Struss 2005). Our
work differs from this work because our focus is on learning
the model and the abstraction simultaneously.
In the remainder of this paper we first give an overview
of the qualitative abstraction and rule learning framework.
We then explain how we incorporate reinforcement learn-
ing into this framework. Finally, we evaluate the viability of
this approach by comparing it to a standard method for re-
inforcement learning with continuous variables on a simple
task.
R
L
D
1 2: ( )do E E⇒C
percept
state
action
Landmarks (c)
(a)
(b)
Behavior
Learn Landmarks (c)
Learn Rules (a)
Learn Context (b)
Figure 1: (a) The agent interacts with the world to learn
rules stating that when the agent makes one event E1 oc-
cur, that another event E2 occurs. (b) For each rule the
agent learns a context C that consists of a set of variables
upon which the agent can learn a conditional probability ta-
ble CPT (r) = Pr(succeeds(r)jC) that tells the agent in
which situations it can cause E2 by making E1 occur. (c)
The agent also learns landmarks that determine how it can
perceive and reason about the world. Landmarks are pro-
posed based on the behavior of rules and in turn determine
which rules can be learned (Mugan & Kuipers 2007a).
Learning Agent Architecture
The overview of the learning agent architecture is shown in
Figure 1.
Qualitative State Representation
The raw input and output is represented using three types
of variables: continuous motor variables, continuous in-
put variables, and nominal input variables. Internally, the
agent represents these variables qualitatively based on QSIM
(Kuipers 1994). The agent creates two variables for each
continuous input variable ~v, a discrete variable v(t) that rep-
resents the magnitude of ~v(t), and a discrete variable _v(t)
that represents the direction of change of ~v(t). And for each
continuous motor variable ~v, the agent creates a discrete
magnitude variable v(t). The result of this is that the agent
represents its world using four types of discrete (qualitative)
variables: motor variables, magnitude variables, direction of
change variables, and nominal variables.
We now describe how the agent converts continuous val-
ues to qualitative values. A continuous variable ~v(t) ranges
over some subset of the real number line ( 1;+1). In
QSIM, this continuous variable ~v(t) is abstracted to a dis-
crete variable v(t) that ranges over a quantity space Q(v)
of qualitative values. Q(v) = L(v) [ I(v), where L(v) =
fv1 ; v
ng is a totally ordered set of landmark values, and
I(v) = f( 1; v1); (v
1 ; v
2); (v
n;+1)g is the set of
mutually disjoint open intervals that L(v) defines in the real
number line. A quantity space with two landmarks might be
described by (v1 ; v
2), which implies five distinct qualitative
values, Q(v) = f( 1; v1); v
1 ; (v
1 ; v
2); v
2 ; (v
2 ;+1)g.
Each direction of change variable _v has a single in-
trinsic landmark at 0, so its quantity space is Q( _v) =
f( 1; 0); 0; (0;+1)g. Magnitude variables initially have
no landmarks because zero is just another point on the num-
ber line, although landmarks are acquired as the agent learns.
Initially, when the agent knows of no meaningful qualitative
distinctions among values for ~v(t), we describe the quantity
space as the empty list of landmarks, (). Motor variables
are given an initial landmark at 0, and like magnitude vari-
ables, they can acquire more landmarks as the agent learns.
As an implementation note, because we evaluate the algo-
rithm with a fine-grained discrete-timestep simulator, if v1 is
a landmark and ~v(t 1) < v1 and ~v(t) > v
1 then v(t) = v
1
for the purpose of rule learning.
Events
If a is a qualitative value of a discrete variable A, meaning
a 2 Q(A), then the event At!a is defined by A(t 1) 6= a
and A(t) = a. That is, an event takes place when a discrete
variable A changes to value a at time t, from some other
value. We will often drop the t and describe this simply as
A!a. We will also refer to an event as E when the variable
and qualitative value involved are not important.
Predictive Rules
We break from previous work (2007a; 2007b) and take in-
spiration from (Pearl 2000) to define predictive rules based
on actions of the agent. A predictive rule r has the form
r = hC : do(E1)) E2i and states that if the agent executes
a plan to bring about event E1, then event E2 will soon fol-
low. The probability that E2 will indeed soon follow E1 is
given in the context C. For an eventE we define dot(E) as a
predicate that is true when the agent begins executing a plan
at time t to bring about E and is false otherwise.
The predictive rule r = hC : do(E1) ) E2i consists of
one event E1(t), and another event E2(t0) that takes place
relatively soon after t. That E2 takes place “relatively soon
after” E1(t) is formalized in terms of an integer time-delay
k (in our current implementation k = 5, or about 0.25 sec-
onds)
soon(t; E2) 9t
0 [t t0 t+ k ^ E2(t
0)]
If we define dor;t(E(t0)) to mean that dot(E) = true
and that E occurs at time t0, then we can define a predicate
succeeds(r; t) for the success of rule r as
succeeds(r; t) dor;t(E1(t
0)) ^ soon(t0; E2)
This means that rule r fails if the agent’s plan to bring
about E1 fails, or if E2 does not soon follow. We shorten
succeeds(r; t) to the predicate succeeds(r), which is true
if r succeeds when activated at an arbitrary time.
Associated with rule r is a context C that consists of a
set of variables. The context induces a conditional probabil-
ity table (CPT) on the predicate succeeds(r). In Bayesian
succeeds(r). For a rule r = hC : do(A!a) ) B! bi
we require that the elements of the context be magnitude or
nominal variables and that for event B!b we require that
B be a nominal variable or that B be a direction of change
variable with b 6= [0].
Learning New Predictive Rules
To learn a new predictive rule the agent searches for two
events E1 and E2 such that observing event E1 means that
event E2 is significantly more likely to occur than it would
have been otherwise. When two such events are found the
agent asserts an initial rule h; : do(E1) ) E2i with an
empty context.
The set of rules grows out of the motor variables. To cre-
ate a rule h; : do(E1) ) E2i we require that the agent be
able to predict event E1 using the currently existing rules.
Rule Context Greedy Search
The purpose of the context is to tell the agent when the rule
will succeed and the agent greedily searches for a good con-
text for each rule. A good context C for r is one for which
there is some value for the variables in C for which the rule
has high reliability. Once this is achieved, the agent desires
that the context predict the outcome of the rule in all states.
We formalize the idea of having a value for which the
rule is highly reliable using the notation of best reliability
brel(r). For a context C = fv1; : : : ; vng, we define the
product space of qualitative values:
Q(C) = Q(v1)Q(v2) : : :Q(vn): (1)
With sufficient observations, we then define best reliability
as the maximum over this product space
brel(r) = max
w2Q(C)
Pr(succeeds(r)jw) (2)
which we can also write as brel(r) = max CPT (r).
Once brel(r) exceeds the threshold r = 0:7 we deter-
mine that the rule is reliable. At this point the agent turns
its attention to being able to predict the outcome of a rule in
any situation. To do this it seeks to minimize the entropy.
The entropy H(Y ) of a random variable is given by
H(Y ) =
X
j
P (Y = yj) log2 P (Y = yj):
The conditional entropy H(Y jX) of a random variable Y
given X is given by
H(Y jX) =
X
i
H(Y jX = xi)P (X = xi)
and is the weighted average of the entropy of Y given X =
xi, weighted by the probability P (X = xi). We define the
entropy H(r) of a rule r = hC : do(E1) ) E2i as the
conditional entropy of succeeds(r) given do(E1) = true
and C. In equation form it is
H(r) = H(succeeds(r)jdo(E1)= true; C): (3)
With these definitions we can now describe how the agent
determines if one context is better than another. For each
rule the agent hillclimbs on the quality of the context. For
a rule r = hC : do(E1) ) E2i with brel(r) < r we
say that the rule r0 = hC0 : do(E1) ) E2i with improved
context C0 is a sufficient improvement over r if brel(r0) >>
brel(r). And for a rule r = hC : do(E1) ) E2i with
brel(r) > r we say that the rule r0 = hC0 : do(E1) )
E2i with improved context C0 is a sufficient improvement
over r if H(r0) << H(r), where the operators >> and
<< mean sufficiently less than and sufficiently greater than,
respectively.
Learning a Context for a Predictive Rule
The context for a predictive rule is learned incrementally.
For each rule r = h; : do(E1) ) E2i with an empty con-
text, the agent searches for a magnitude or nominal variable
v1 such that if r is modified to be r0 = hfv1g : do(E1) )
E2i then r0 is a sufficient improvement.
Using an approach inspired by Drescher (1991), once the
agent has learned a rule r0 = hfv1g : do(E1) ) E2i it
searches for another discrete magnitude or nominal vari-
able v2 such that if r0 is modified to be r00 = hfv1; v2g :
do(E1) ) E2i then r00 is a sufficient improvement. This
criterion clearly generalizes, but in our current implementa-
tion we limit the size of the context to two.
Learning New Landmarks
Inserting a new landmark x into (xi ; x
i+1) allows that in-
terval to be replaced in Q(x) by two intervals and the di-
viding landmark: (xi ; x
), x, (x; xi+1). Adding this new
landmark into the quantity spaceQ(x) allows a new distinc-
tion to be made that may transform a rule r into a new rule
r0. A new landmark can be learned either by improving a
predictive rule or by creating an event that reliably precedes
another event.
Landmarks that Improve Rules Landmark candidates
are generated for a rule r = hC : do(A! a) ) B! bi
using the success or failure of r as a reward signal. A land-
mark candidate for r is adopted if it sufficiently improves r.
A landmark can improve r by refining the event A!a or by
refining a variable in C.
To learn new landmarks it is not necessary to store the
entire history. Instead, we only store the real values of all the
variables for the last 200 activations of each rule. Landmark
candidates are chosen considering the number of data points
in the interval and the highest gain (Fayyad & Irani 1993).
Depending on the distance from the new landmark x to the
maximum and minimum observed values of x, this search
can result in either a precise numerical value, or a range of
possible values for x on different occasions: range(x) =
[lb; ub].
Landmarks Suggested by Events A landmark x is cre-
ated for a variable x if it is estimated that the event x!x
will reliably predict some other event E. To find this land-
mark, for each event E a histogram is maintained for each
continuous variable ~x. Each time E occurs the histogram is
updated with the current value of ~x. One or more landmark
candidates is created for ~x when the distribution of ~x when
tribution of ~x. The location of each landmark x is taken to
be the middle of a histogram bucket where the difference is
the greatest.
Acting in the World
The Controller
The controller enables the agent to learn efficiently by ac-
tively choosing rules to test. In Mugan and Kuipers (Mugan
& Kuipers 2007a) active learning was motivated by the de-
sire to achieve certain goals, in this paper the motivation for
active learning is improving the reliability of rules.
Choosing a Rule to Invoke The controller chooses a
rule to invoke based on its weight w. The weight of a
rule r consists of two components w1 and w2, and w =
max(; w1w2), where = 0:001. If we use the nota-
tion Pr(succeeds(r)jC; s) to mean the probability of suc-
cess of r in the current state s, then the component w1 =
Pr(succeeds(r)jC; s). The component w2 reflects the rate
at which the reliability of the rule is increasing, inspired by
the “curiosity drive” of Oudeyer and Kaplan [2004].
Invoking a Rule Once the rule r = hC : do(A!a) )
B!bi has been chosen, the agent forms a plan to achieve
A!a. To do this, the controller examines the context C.
We say that the context is satisfied if in the current state
the context says the rule will be sufficiently reliable, where
sufficiently reliable means that Pr(succeeds(r)jC) > sr,
where sr = 0:60. There are three cases:
1. The context is satisfied.
2. The context is not satisfied and consists of only one vari-
able.
3. The context is not satisfied and consists of more than one
variable.
If the context is satisfied, the agent sets the goal to be
do(A!a). If the context is not satisfied but consists of only
a single variable V , then the agent sets the goal to be do(V!
v) where v 2 Q(V ) has the highest Pr(succeeds(r)jV =
v). If do(V!v) is successful, the agent then sets the goal to
be do(A!a). If the context is not satisfied and consists of
more than one variable, then the agent sets the goal to be any
member of the set Good(CPT (r)) defined in equation (4)
(if the set Good(CPT (r)) is empty then the context is ig-
nored). Once this goal is achieved the agent sets the goal to
be do(A!a).
Backchaining Actions
Goals of the form do(Y ! y) are achieved through
backchaining. The approach to achieve a goal do(Y !y)
depends on the type of variable Y . (1) If Y is a motor vari-
able, then a random real value is picked from the range of the
qualitative value y and the action is complete. (2) If Y is a
direction of change or nominal variable, then the agent looks
for a reliable rule of the form r = hC : do(X!x)) Y!yi
that in the current state is predicted to succeed with relia-
bility sr and invokes do(X!x). If no such rule is found
{ , }: ( [0]) ( ,0)x y xr c c do d b= fi ⇒ fi -¥ɺ
r succeeds r failsr fails
( )CPT r
(a)
(b)
(c) -table :Q · fi ℝS A(d)
yc
xc
yc
xc
( ( ))Good CPT r
:r Up fiC
yc
xc · · ·
: ( ) ( ( ))r do Good CPT rp′ = ˘ ⇒ fiC
(e)
(f)
Figure 2: (a) The rule r = hfcx; cyg : do(d![0]) ) _bx!
( 1; 0)i is an example of a rule learned by the robot. It
states that if the distance d between the hand and the block
goes to 0, then the event _bx! ( 1; 0) of the block mov-
ing to the left will occur. The predicted success of this rule
depends on the context variables cx and cy that give the lo-
cation of the hand in the frame of reference of the block. (b)
The agent gathers experience in the world to learn the con-
text values for which r is successful. The agent learns that
the hand must be to the right of and level with the block for
r to be successful. (c) Based on C = fcx; cyg the agent cre-
ates a conditional probability table CPT (r) for r and uses
a threshold to determine the set Good(CPT (r)) of values
of C for which the rule r is likely to succeed. (d) The agent
can then define a simple reinforcement learning problem in
which C defines the state space, andGood(CPT (r)) defines
the goal states. The agent is rewarded for reaching a state in
which the rule r is likely to succeed. To do this, the agent
creates a Q-table that maps S A to a value R, where A is
the set of primitive actions (defined by the qualitative values
of the motor variables ux and uy). (e) The agent then defines
a policy r by associating each cell in C with the primitive
action with maximum value. (f) The policy r can then be
described by a new rule r0 that treats r as an action leading
to the region Good(CPT (r)) where r is likely to succeed.
or if r fails, then backchaining fails. (3) If Y is a magni-
tude variable then the agent uses a special rule of the form
h2 = hdo( _Y !( 1; 0)) until Y !yi if ~Y (t) > y. Rule
h1 fails if do( _Y ! (0;+1)) is not achieved or if after
_Y!(0;+1) an event occurs such that _Y 6= (0;+1) before
Y!y. Rule h2 works similarly. If during backchaining an
eventE occurs more than once, or if events v!( 1; 0) and
v!(0;+1) both occur for some variable v, then backchain-
ing fails.
Once a motor variable is reached, its value is main-
tained until event Y !y occurs or one of the rules used in
backchaining fails.
Reinforcement Learning Actions
The agent uses reinforcement learning to achieve goals over
multiple variables. For each rule r = hC : do(E1) ) E2i
with a context of more than one variable, the agent creates
a reinforcement learning problem to enable the agent to get
into a state such that doing event E1 will cause event E2.
The overview of this process is shown in Figure 2.
The type of reinforcement learning we use is Sarsa()
(Sutton & Barto 1998) where = 0:9, is one over the
number of times the state has been visited, and the discount
parameter
= 0:9. To learn the policy r the agent learns a
value-action function Q : S;A ! R.
The state space S is defined by the qualitative variables in
C and their landmarks. To define the set of primitive actions
A we first define a setQ(U) = Q(u1) : : :Q(un) where
u1; : : : ; un is the set of motor variables. We can then define
a primitive action a 2 A as choosing a w 2 Q(U), taking
random values from the ranges of the qualitative values in
w, and maintaining those values until the state S changes or
the real values underlying the variables that make up S stop
changing.
For the reward function we use a goal-reward representa-
tion (Koenig & Simmons 1996). The reward is based on the
set of goal states Good(CPT (r)) and is determined by the
conditional probability table CPT (r):
Good(CPT (r)) = (4)
fw 2 Q(C) jPr(succeeds(r)jw) > srg
The agent then learns theQ-table by using -greedy action
selection where = 0:25. An episode begins when the rule
r is invoked by the controller, and the episode ends when the
agent makes it to a goal state or when 20 primitive actions
have been taken.
Once the Q-table is learned, a policy r can be created
whereby the agent chooses the best primitive action for each
state. In effect, this in principle leads to a new rule of the
form r0 = h; : do(r)) C!Good(CPT (r))i.
Experimental Evaluation
We evaluate our algorithm using the simulated agent shown
in Figure 3. The evaluation task we have chosen is for the
agent to hit the block in a specified direction. To show
that our representation can effectively be used for reinforce-
ment learning, we compare our method to using a hand-
created reinforcement learning agent trained specifically for
this task. For this evaluation we trained ten agents total, five
Figure 3: A simulated “robot baby” is implemented in Breve
(Klein 2003). It has a torso with a 2-dof orthogonal arm
and is sitting in front of a tray with a block. The robot has
two motor variables ~ux and ~uy that move the hand in the x
and y directions, respectively. The perceptual system creates
variables for each of the two tracked objects in this environ-
ment: the hand and the block. The hand is described by two
continuous variables ~hx(t), ~hy(t) that represent the location
of the hand in the x and y directions, respectively, and the
Boolean variable ha(t) that represents whether the hand is
in view. The variables corresponding to the block are ~bx(t),
~by(t), and ba(t) and they have the same respective meanings
as the variables for the hand. The relationship between the
hand and the block is represented by the continuous vari-
ables ~cx(t), ~cy(t), and ~d(t). The variables ~cx(t) and ~cy(t)
represent the coordinates of the center of the hand in the
frame of reference of the center of the block, and the vari-
able ~d(t) represents the distance between the hand and the
block. The values of all variables are updated by perceptual
trackers at each timestep as the objects move.
autonomous agents described in this paper, and five hand-
created learning agents.
We trained each agent in the environment shown in Fig-
ure 3 for 340,000 timesteps (almost five hours of physical
experience). During this time, the hand-created agents con-
tinually repeated episodes of the task, and the autonomous
agents performed the learning algorithm described in this
paper. During training of the autonomous agents, if the
block fell off the tray, moved out of reach of the agent, or
was not moved for an extended time, the block was moved
to a random location within reach of the agent. For all agents
we stored the state of each agent’s knowledge every 20,000
timesteps during training (corresponding to about sixteen
minutes of physical experience). We then ran the evalua-
tion for each agent using their respective stored knowledge
bases.
Each evaluation consisted of 100 trials. At the begin-
ning of each trial the block was placed in a random location
within reach of the agent and the evaluator picked one of
three goals: hitting the block to the left, hitting the block to
the right, or hitting the block forward. The agent then had
300 timesteps to use its knowledge to hit the block in the
correct direction. A trial was terminated unsuccessfully if
the agent hit the block in the wrong direction. The evalua-
Articial Intelligence, volume 2, 1022–1027.
Klein, J. 2003. Breve: a 3d environment for the simulation
of decentralized systems and artificial life. In Proceedings
of the International Conference on Artificial Life, 329–334.
Koenig, S., and Simmons, R. 1996. The effect of rep-
resentation and knowledge on goal-directed exploration
with reinforcement-learning algorithms. Machine Learn-
ing 22(1):227–250.
Kuipers, B. 1994. Qualitative Reasoning. Cambridge,
Massachusetts: The MIT Press.
McGovern, A., and Barto, A. G. 2001. Automatic dis-
covery of subgoals in reinforcement learning using diverse
density. In Proceedings International Conference on Ma-
chine Learning, 361–368.
Mugan, J., and Kuipers, B. 2007a. Learning distinctions
and rules in a continuous world through active exploration.
In Proceedings of the International Conference on Epige-
netic Robotics.
Mugan, J., and Kuipers, B. 2007b. Learning to predict the
effects of actions: Synergy between rules and landmarks.
In Proceedings of the International Conference on Devel-
opment and Learning.
Oudeyer, P.-Y., and Kaplan, F. 2004. Intelligent adaptive
curiosity. In Proceedings of the International Conference
on Epigenetic Robotics.
Pearl, J. 2000. Causality: Modeling, Reasoning, and In-
ference. Cambridge: Cambridge University Press.
Sachenbacher, M., and Struss, P. 2005. Task-dependent
qualitative domain abstraction. Artificial Intelligence
162(1-2):121–143.
Santamaria, J.; Sutton, R.; and Ram, A. 1997. Experiments
with Reinforcement Learning in Problems with Continuous
State and Action Spaces. Adaptive Behavior 6(2):163.
Simsek, O., and Barto, A. 2004. Using relative novelty
to identify useful temporal abstractions in reinforcement
learning. Proceedings of the Twenty-First International
Conference on Machine Learning 751–758.
Simsek, O.; Wolfe, A.; and Barto, A. 2005. Identifying
useful subgoals in reinforcement learning by local graph
partitioning. Proceedings of the Twenty-Second Interna-
tional Conference on Machine Learning 816–823.
Sutton, R. S., and Barto, A. G. 1998. Reinforcement Learn-
ing. Cambridge MA: MIT Press.
Sign up today - FREE
Mendeley saves you time finding and organizing research. Learn more
- All your research in one place
- Add and import papers easily
- Access it anywhere, anytime


