Sequential Decision Making Based on Direct Search
Abstract
The most challenging open issues in sequential decision making include partial observability of the decision maker's environment, hierarchical and other types of credit assignment, the learning of credit assignment algorithms, and exploration without a priori world models. I will summarize why direct search (DS) in policy space provides a more natural framework for addressing these issues than reinforcement learning (RL) based on value functions and dynamic programming. Then I will point out fundamental drawbacks of traditional DS methods in case of stochastic environments, stochastic policies, and unknown temporal delays between actions and observable effects. I will discuss a remedy called the success-story algorithm, show how it can outperform traditional DS, and mention a relationship to market models combining certain aspects of DS and traditional RL.
Sequential Decision Making Based on Direct Search
Based on Direct Search
Ju¨rgen Schmidhuber
IDSIA, Galleria 2, 6928 Manno (Lugano), Switzerland
1 Introduction
The most challenging open issues in sequential decision making include partial
observability of the decision maker’s environment, hierarchical and other types
of abstract credit assignment, the learning of credit assignment algorithms, and
exploration without a priori world models. I will summarize why direct search
(DS) in policy space provides a more natural framework for addressing these
issues than reinforcement learning (RL) based on value functions and dynamic
programming. Then I will point out fundamental drawbacks of traditional DS
methods in case of stochastic environments, stochastic policies, and unknown
temporal delays between actions and observable effects. I will discuss a remedy
called the success-story algorithm, show how it can outperform traditional DS,
and mention a relationship to market models combining certain aspects of DS
and traditional RL.
Policy learning. A learner’s modifiable parameters that determine its be-
havior are called its policy. An algorithm that modifies the policy is called a
learning algorithm. In the context of sequential decision making based on rein-
forcement learning (RL) there are two broad classes of learning algorithms: (1)
methods based on dynamic programming (DP) (Bellman, 1961), and (2) direct
search (DS) in policy space. DP-based RL (DPRL) learns a value function map-
ping input/action pairs to expected discounted future reward and uses online
variants of DP for constructing rewarding policies (Samuel, 1959; Barto, Sut-
ton, & Anderson, 1983; Sutton, 1988; Watkins, 1989; Watkins & Dayan, 1992;
Moore & Atkeson, 1993; Bertsekas & Tsitsiklis, 1996). DS runs and evaluates
policies directly, possibly building new policy candidates from those with the
highest evaluations observed so far. DS methods include variants of stochastic
hill-climbing (SHC), evolutionary strategies (Rechenberg, 1971; Schwefel, 1974),
genetic algorithms (GAs) (Holland, 1975), genetic programming (GP) (Cramer,
1985; Banzhaf, Nordin, Keller, & Francone, 1998), Levin Search (Levin, 1973,
1984), and adaptive extensions of Levin Search (Solomonoff, 1986; Schmidhuber,
Zhao, & Wiering, 1997b).
Outline. DS offers several advantages over DPRL, but also has some draw-
backs. I will list advantages first (section 2), then describe an illustrative task
unsolvable by DPRL but trivially solvable by DS (section 3), then mention a
few theoretical results concerning DS in general search spaces (section 4), then
point out a major problem of DS (section 5), and offer a remedy (section 6 and
section 7).
R. Sun and C.L. Giles (Eds.): Sequence Learning, LNAI 1828, pp. 213–240, 2000.
c
© Springer-Verlag Berlin Heidelberg 2000
2 Advantages of Direct Search
2.1 DS Advantage 1: No States
Finite time convergence proofs for DPRL (Kearns & Singh, 1999) require (among
other things) that the environment can be quantized into a finite number of di-
screte states, and that the topology describing possible transitions from one state
to the next, given a particular action, is known in advance. Even if the real world
was quantizable into a discrete state space, however, for all practical purposes
this space will be inaccessible and remain unknown. Current proofs do not cover
apparently minor deviations from the basic principle, such as the world-class
RL backgammon player (Tesauro, 1994), which uses a nonlinear function appro-
ximator to deal with a large but finite number of discrete states and, for the
moment at least, seems a bit like a miracle without full theoretical foundation.
Prior knowledge about the topology of a network connecting discrete states is
also required by algorithms for partially observable Markov decisicion processes
(POMDPs), although they are more powerful than standard DPRL, e.g., (Kael-
bling, Littman, & Cassandra, 1995; Littman, Cassandra, & Kaelbling, 1995). In
general, however, we do not know a priori how to quantize a given environment
into meaningful states.
DS, however, completely avoids the issues of value functions and state iden-
tification — it just cares for testing policies and keeping those that work best.
2.2 DS Advantage 2: No Markovian Restrictions
Convergence proofs for DPRL also require that the learner’s current input con-
veys all the information about the current state (or at least about the optimal
next action). In the real world, however, the current sensory input typically tells
next to nothing about the “current state of the world,” if there is such a thing
at all. Typically, memory of previous events is required to disambiguate inputs.
For instance, as your eyes are sequentially scanning the visual scene dominated
by this text you continually decide which parts (or possibly compressed descrip-
tions thereof) deserve to be represented in short-term memory. And you have
presumably learned to do this, apparently by some unknown, sophisticated RL
method fundamentally different from DPRL.
Some DPRL variants such as Q(λ) are limited to a very special kind of ex-
ponentially decaying short-term memory. Others simply ignore memory issues
by focusing on suboptimal, memory-free solutions to problems whose optimal
solutions do require some form of short-term memory (Jaakkola, Singh, & Jor-
dan, 1995). Again others can in principle find optimal solutions even in partially
observable environments (POEs) (Kaelbling et al., 1995; Littman et al., 1995),
but they (a) are practically limited to very small problems (Littman, 1996), and
(b) do require knowledge of a discrete state space model of the environment. To
various degrees, problem (b) also holds for certain hierarchical RL approaches to
memory-based input disambiguation (Ring, 1991, 1993, 1994; McCallum, 1996;
Wiering & Schmidhuber, 1998). Although no discrete models are necessary for
Sign up today - FREE
Mendeley saves you time finding and organizing research. Learn more
- All your research in one place
- Add and import papers easily
- Access it anywhere, anytime


