Intrinsically Motivated Information Foraging
Learning (2010)
- ISBN: 9781424469000
- DOI: 10.1109/DEVLRN.2010.5578859
Available from ieeexplore.ieee.org
or
Abstract
We treat information gathering as a POMDP in which the goal is to maximize an accumulated intrinsic reward at each time step based on the negative entropy of the agents beliefs about the world state. We show that such information foraging agents can discover intelligent exploration policies that take into account the long-term effects of sensor and motor actions, and can automatically adapt to variations in sensor noise, different amounts of prior information, and limited memory conditions.
Page 1
Intrinsically Motivated Information Foraging
Intrinsically Motivated Information Foraging
Ian Fasel, Andrew Wilt, Nassim Mafi, Clayton T. Morrison
The University of Arizona, Department of Computer Science, Tucson, AZ 85721-0077
fianfasel,anwilt,nmafi,claytong@cs.arizona.edu
Abstract—We treat information gathering as a POMDP
in which the goal is to maximize an accumulated intrinsic
reward at each time step based on the negative entropy
of the agent’s beliefs about the world state. We show that
such information foraging agents can discover intelligent
exploration policies that take into account the long-term
effects of sensor and motor actions, and can automatically
adapt to variations in sensor noise, different amounts of
prior information, and limited memory conditions.
I. INFORMATION FORAGING AS OPTIMAL CONTROL
Robots, like biological organisms, are typically
equipped with multiple sensors designed to acquire many
different types of information (e.g., visible light, audio,
touch). Each of these sensors may have different effective
ranges or fields of view, and accuracy may degrade as
a function of distance from the robot. Often it is not
possible to run all sensors at all locations simultaneously,
either because of limitations in processing power, or
because certain sensors are mutually exclusive (for in-
stance, tasting an object may obscure it from view). Thus
acquiring information about the world often requires nav-
igating around and carefully selecting different sensing
actions. We refer to this as ”Information Foraging”.
In this paper, we explore the idea of information
foraging as a stochastic optimal control problem, where
the intrinsic goal is to reduce total uncertainty about
the world state. This approach has been called InfoMax
control by [1], and has previously been studied for
discovering social contingencies or fixation targets in
foveated cameras [2], [3]. We extend the idea of InfoMax
control to an agent which must decide, at each point in
time, whether it should select one out of a battery of
sensors to apply at its current location (and if so, which
one), or if instead it should move to another location.
Through multiple experiments, we show that it is
possible to learn intelligent exploration policies that
take into account the long-term effects of sensor and
motor actions through an intrinsic uncertainty reduction
reward. Moreover, we find they can automatically adapt
to variations in sensor noise, different prior information,
and limited memory conditions.
II. BACKGROUND
The problem faced by a robot choosing sensing actions
to gain knowledge about its environment is conceptually
similar to the problem faced by a doctor choosing
medical tests or a neuroscientist choosing experimental
stimuli to find input-output relationships in a population
of neurons. The problem of choosing a design of experi-
ments in order to maximize some overall utility has been
treated extensively in the statistics community [4], [5],
[6] under the terminology of Bayesian optimal experi-
mental design (OED), and the same basic idea has been
applied in many science and engineering domains such
as computer vision [7], machine learning [8], molecular
biology [9] and conceptual psychology [10].
Bayesian OED generally involves integrating over all
possible observations that could be obtained by exper-
iments in a design, then selecting the design with the
highest expected utility. While there are a variety of
appropriate choices for utility (see [6] for a thorough
review), a commonly used utility, suggested by Lindley
[4], is the expected gain in Shannon information. Fol-
lowing this suggestion, MacKay [11] showed that equiv-
alently one could use the expected change in entropy
from one stage of experimentation to the next as a way
of adaptively modifying the design of experiments as
they unfold, called Adaptive Design Optimization [12].
Recently, Movellan [1] extended this idea to au-
tonomous agents. In this framework, the agent maintains
a set of beliefs about the current state of the world, then
takes actions in order to reduce its uncertainty about the
world as quickly as possible. This can be specified for-
mally as a partially observable Markov decision process
(POMDP), in which at each time step the agent takes
an action, receives an observation, updates the belief
distribution, and receives an intrinsic reward proportional
to the negative entropy of the agent’s belief distribu-
tion over possible world states. This has been termed
“InfoMax control”, because minimizing the conditional
entropy of the state beliefs given a sequence of actions
and observations is equivalent to maximizing the mutual
Ian Fasel, Andrew Wilt, Nassim Mafi, Clayton T. Morrison
The University of Arizona, Department of Computer Science, Tucson, AZ 85721-0077
fianfasel,anwilt,nmafi,claytong@cs.arizona.edu
Abstract—We treat information gathering as a POMDP
in which the goal is to maximize an accumulated intrinsic
reward at each time step based on the negative entropy
of the agent’s beliefs about the world state. We show that
such information foraging agents can discover intelligent
exploration policies that take into account the long-term
effects of sensor and motor actions, and can automatically
adapt to variations in sensor noise, different amounts of
prior information, and limited memory conditions.
I. INFORMATION FORAGING AS OPTIMAL CONTROL
Robots, like biological organisms, are typically
equipped with multiple sensors designed to acquire many
different types of information (e.g., visible light, audio,
touch). Each of these sensors may have different effective
ranges or fields of view, and accuracy may degrade as
a function of distance from the robot. Often it is not
possible to run all sensors at all locations simultaneously,
either because of limitations in processing power, or
because certain sensors are mutually exclusive (for in-
stance, tasting an object may obscure it from view). Thus
acquiring information about the world often requires nav-
igating around and carefully selecting different sensing
actions. We refer to this as ”Information Foraging”.
In this paper, we explore the idea of information
foraging as a stochastic optimal control problem, where
the intrinsic goal is to reduce total uncertainty about
the world state. This approach has been called InfoMax
control by [1], and has previously been studied for
discovering social contingencies or fixation targets in
foveated cameras [2], [3]. We extend the idea of InfoMax
control to an agent which must decide, at each point in
time, whether it should select one out of a battery of
sensors to apply at its current location (and if so, which
one), or if instead it should move to another location.
Through multiple experiments, we show that it is
possible to learn intelligent exploration policies that
take into account the long-term effects of sensor and
motor actions through an intrinsic uncertainty reduction
reward. Moreover, we find they can automatically adapt
to variations in sensor noise, different prior information,
and limited memory conditions.
II. BACKGROUND
The problem faced by a robot choosing sensing actions
to gain knowledge about its environment is conceptually
similar to the problem faced by a doctor choosing
medical tests or a neuroscientist choosing experimental
stimuli to find input-output relationships in a population
of neurons. The problem of choosing a design of experi-
ments in order to maximize some overall utility has been
treated extensively in the statistics community [4], [5],
[6] under the terminology of Bayesian optimal experi-
mental design (OED), and the same basic idea has been
applied in many science and engineering domains such
as computer vision [7], machine learning [8], molecular
biology [9] and conceptual psychology [10].
Bayesian OED generally involves integrating over all
possible observations that could be obtained by exper-
iments in a design, then selecting the design with the
highest expected utility. While there are a variety of
appropriate choices for utility (see [6] for a thorough
review), a commonly used utility, suggested by Lindley
[4], is the expected gain in Shannon information. Fol-
lowing this suggestion, MacKay [11] showed that equiv-
alently one could use the expected change in entropy
from one stage of experimentation to the next as a way
of adaptively modifying the design of experiments as
they unfold, called Adaptive Design Optimization [12].
Recently, Movellan [1] extended this idea to au-
tonomous agents. In this framework, the agent maintains
a set of beliefs about the current state of the world, then
takes actions in order to reduce its uncertainty about the
world as quickly as possible. This can be specified for-
mally as a partially observable Markov decision process
(POMDP), in which at each time step the agent takes
an action, receives an observation, updates the belief
distribution, and receives an intrinsic reward proportional
to the negative entropy of the agent’s belief distribu-
tion over possible world states. This has been termed
“InfoMax control”, because minimizing the conditional
entropy of the state beliefs given a sequence of actions
and observations is equivalent to maximizing the mutual
Sign up today - FREE
Mendeley saves you time finding and organizing research. Learn more
- All your research in one place
- Add and import papers easily
- Access it anywhere, anytime
Start using Mendeley in seconds!
Readership Statistics
6 Readers on Mendeley
by Discipline
33% Engineering
by Academic Status
33% Student (Master)
33% Ph.D. Student
17% Associate Professor
by Country
50% Japan
33% Germany
17% United States


