Sign up & Download
Sign in

Spatiotemporal energy models for the perception of motion.

by E H Adelson, J R Bergen
Journal of the Optical Society of America A Optics and image science (1985)

Abstract

A motion sequence may be represented as a single pattern in x-y-t space; a velocity of motion corresponds to a three-dimensional orientation in this space. Motion sinformation can be extracted by a system that responds to the oriented spatiotemporal energy. We discuss a class of models for human motion mechanisms in which the first stage consists of linear filters that are oriented in space-time and tuned in spatial frequency. The outputs of quadrature pairs of such filters are squared and summed to give a measure of motion energy. These responses are then fed into an opponent stage. Energy models can be built from elements that are consistent with known physiology and psychophysics, and they permit a qualitative understanding of a variety of motion phenomena.

Cite this document (BETA)

Available from www.ncbi.nlm.nih.gov
Page 1
hidden

Spatiotemporal energy models for the perception of motion.

284 J. Opt. Soc. Am. A/Vol. 2, No. 2 February 1985 E. H. Adelson and J. R. Bergen
Spatiotemporal energy models for the perception of motion
Edward H. Adelson and James R. Bergen
David Sarnoff Research Center, RCA, Princeton, New Jersey 08540
Received July 9, 1984; accepted October 12, 1984
A motion sequence may be represented as a single pattern in x-y-t space; a velocity of motion corresponds to a
three-dimensional orientation in this space. Motion information can be extracted by a system that responds to
the oriented spatiotemporal energy. We discuss a class of models for human motion mechanisms in which the first
stage consists of linear filters that are oriented in space-time and tuned in spatial frequency. The outputs of quad-
rature pairs of such filters are squared and summed to give a measure of motion energy. These responses are then
fed into an opponent stage. Energy models can be built from elements that are consistent with known physiology
and psychophysics, and they permit a qualitative understanding of a variety of motion phenomena.
1. INTRODUCTION
When we watch a movie, we see a sequence of images in which
objects appear at a sequence of positions. Although each
frame represents a frozen instant of time, the movie gives us
a convincing impression of motion. Somehow the visual
system interprets the succession of still images so as to arrive
at a perception of a continuously moving scene.
This phenomenon represents one form of apparent motion.
How is it that we see apparent motion? One possibility is that
our visual system matches up corresponding points in suc-
ceeding frames and calculates an inferred velocity based on
the distance traveled over the frame interval. Much research
on apparent motion has taken the establishment of this cor-
respondence to be the fundamental problem to be solved.
1-3

We argue that this correspondence problem can often be by-
passed altogether; we take up this argument after discussing
various approaches to the problem of motion analysis.
Figure la shows a vertical bar, which is presented at a se-
quence of discrete positions at a sequence of discrete times.
In a typical feature-matching model, the visual system is said
to (1) find salient features in successive frames; (2) establish
a correspondence between them; (3) determine ∆x, the dis-
tance traveled, and ∆t the time between frames; and, finally,
(4) compute the velocity as ∆x/∆t. In this example, the
features to be matched might be the edges of the bar.
In a typical global matching model, the visual system would
perform a match over some large region of the image, in es-
sence performing a template match by sliding the image from
one frame to match the image optimally in the next frame.
Most cross-correlation models (see, e.g., Lappin and Bell
4
) are
examples of the global matching approach. Once again, ∆x
and ∆t can be determined, and the velocity can be inferred.
Matching models are designed to make predictions about
stimuli presented as sequences of frames (e.g., movies). Not
all stimuli fall naturally into such a description. In an ordi-
nary television, for example, the electron beam illuminates
adjacent points in a rapid sequence, sweeping out the even
lines of the raster pattern on one field and then returning to
fill in the odd lines on the next field (two fields constitute a
frame). Should the matching be taken between frames or
between fields? For that matter, why should it not be taken
between the successively illuminated points themselves? (Note
that the motion of the raster itself which is normally
invisible, will become visible if the raster is quite slow.)
Although the answer is not immediately obvious, it is clear
that we need to consider the well-known persistence of visual
responses i.e., the temporal filtering imposed by early visual
mechanisms in order to make sense of even the simplest
phenomena of apparent motion. The rapidly illuminated
points on a television screen are blended together in time,
effectively making all the lines of a frame (including both
fields) visually present at one time. One approach to motion
modeling, therefore, is to build in a temporal-filtering stage
that preprocesses the visual input before it is passed along to
the matching system. The resulting model treats the stimulus
in both a continuous and a discrete fashion. Filtering is a
continuous operation and leads to a continuously varying
output, whereas matching is discrete, taking place between
images sampled at two particular moments in time. Having
been forced to introduce filtering into the model, we would like
to make full use of its properties. In fact, filtering can be used
to extract the motion information itself, thus rendering the
discrete matching stage superfluous.
There are other reasons for shying away from matching
models as they are commonly presented. They can usually
make predictions about simple stimuli such as a moving bar,
but they may run into trouble when presented with a sequence
such as is shown in Fig. 1b. Here, a sequence of vertical ran-
dom noise patterns is presented. When this sequence is
viewed, complex motions are seen, varying from point to
point in the image. Different velocities are seen at different
positions, and these velocities change rapidly. A feature-mat-
ching model has difficulty making predictions because of the
familiar problems: What constitutes a feature? What should be
matched to what? Most feature-based models are not well
enough defined to offer predictions about a stimulus such as
that of Fig. 1b. Yet motion is seen, and we would like to be-
lieve that this motion percept is generated by the same lawful
processes that generate the percept of the moving bar.
Can a global matching model, such as a cross-correlation
model, do better? Again, it is hard to know what such a model
will predict. Most global matching models have been formu-
lated only to deal with the visibility of single global motions
and thus cannot be easily applied to the situation in which
many motions are seen at different points in the field.
'1985 Optical Society of America0740-3232/85/020284-16$02.00
Page 2
hidden
E. H. Adelson and J. R. Bergen Vol. 2, No. 2/February 1985/J. Opt. Soc. Am. A 285
Fig. 1. a, A sequence of images presented at times t
1
, t
2
, and t
3
showing a bar moving to the right. b, A sequence of vertical random noise patterns,
also shown at three successive instants of time. Motion is seen in each case. The motion percept is simple in a and complex in b, but a motion
model should be able to handle both cases.
A number of approaches have recently been developed that
can be used with complex inputs such as the dynamic noise
of Fig. lb. Marr and Ullman
5
describe a method for ex-
tracting the motion of zero crossings in the outputs of linear
filters by comparing the sign of the filter output to the sign of
its temporal derivative at the zero crossing. A rather different
approach has been described by van Santen and Sperling
6
in
an elaboration of Reichardt’s
7
model in which a local corre-
lation (i.e., multiplication) is performed across space and time.
In van Santen and Sperling’s model, filters tuned for spatial
frequency serve as the inputs to the correlator stages. Van
Santen and Sperling provide a formal analysis of the model’s
properties, describe a set of linking assumptions, and show
that the model makes correct predictions about a large variety
of simple motion displays. A third approach has been de-
scribed by Watson and Ahumada
8
: Motion information is
extracted with simple linear filters without a multiplicative
stage, the filters are tuned for spatial and temporal frequency
as well as velocity, and directional selectivity is achieved by
setting up the appropriate phase relationships between an
underlying pair of filters. It is notable that this approach
achieves directional selectivity without any nonlinearities
(although some sort of nonlinearity must, of course, be present
at some point for motion detection to occur). Ross and Burr
9
have also proposed that the visual system extracts motion
information with directionally tuned linear filters. Morgan
10
has applied linear-filtering concepts to stroboscopic displays,
and Adelson
11
has discussed how a number of motion illusions
can be understood in terms of mechanisms that respond to the
motion energy within particular spatiotemporal-frequency
bands.
Although it is not immediately apparent, there are signifi-
cant formal connections between the linear-filtering approach
and the correlational approach of a Reichardt-style model, as
has been previously noted.
6,l2
The topic is taken up in Ap-
pendix A; at this point, we simply comment that both types
of model can be considered to respond to motion energy within
a given spatiotemporal-frequency band (a property that will
be discussed at greater length below).
Our interest in this paper is not so much to discuss a par-
ticular model as to discuss a general class of models and not
so much to discuss this class as to discuss a general approach
to the problem of motion detection. We will consider models
closely related to the ones just mentioned models that are
based on a simple low-level analysis of visual information,
starting with the outputs of linear filters. This kind of pro-
cessing is well understood and can be readily applied to any
stumulus input. Moreover, it is just the kind of processing that
is considered to occur early in the visual pathway, based on a
large variety of psychophysical and physiological experi-
ments.
13-16
Low-Level Processing in Motion Perception
A low-level approach seems particularly appropriate when one
is dealing with motion phenomena that occur with a rapid
sequence of presentations. Many investigators have found that
these rapid presentations lead to motion percepts that are
determined by rather simple low-level properties of the
stimuli.
Braddick
17
provided evidence for two distinct kinds of
motion mechanisms in apparent motion. He called them
long-range and short-range mechanisms. The short-range
process operates over rather short spatial distances and short
time intervals and involves low-level kinds of visual informa-

Sign up today - FREE

Mendeley saves you time finding and organizing research. Learn more

  • All your research in one place
  • Add and import papers easily
  • Access it anywhere, anytime

Start using Mendeley in seconds!

Already have an account? Sign in

Readership Statistics

141 Readers on Mendeley
by Discipline
 
 
 
by Academic Status
 
40% Ph.D. Student
 
21% Post Doc
 
9% Student (Master)
by Country
 
38% United States
 
14% Germany
 
12% United Kingdom