Sign up & Download
Sign in

Voting Experts: An Unsupervised Algorithm for Segmenting Sequences

by Paul Cohen, Niall Adams, Brent Heeringa
Intelligent Data Analysis (2007)
  • ISSN: 1088467X

Abstract

We describe a statistical signature of chunks and an algorithm for finding chunks. While there is no formal definition of chunks, they may be reliably identified as configurations with low internal entropy or unpredictability and high entropy at their boundaries. We show that the log frequency of a chunk is a measure of its internal entropy. The Voting-Experts exploits the signature of chunks to find word boundaries in text from four languages and episode boundaries in the activities of a mobile robot.

Cite this document (BETA)

Available from iospress.metapress.com
Page 1
hidden

Voting Experts: An Unsupervised Algorithm for Segmenting Sequences

Voting Experts: An Unsupervised Algorithm for Segmenting
Sequences
Paul Cohen1, Niall Adams2, Brent Heeringa3
1USC Information Sciences Institute
2Department of Mathematics, Imperial College London, UK
3Department of Computer Science, University of Massachusetts
July 15, 2006
Abstract
We describe a statistical signature of chunks and an algorithm for finding chunks.
While there is no formal definition of chunks, they may be reliably identified as config-
urations with low internal entropy or unpredictability and high entropy at their bound-
aries. We show that the log frequency of a chunk is a measure of its internal entropy.
The Voting-Experts exploits the signature of chunks to find word boundaries in text
from four languages and episode boundaries in the activities of a mobile robot.
1 Introduction
“I have fallen into the custom of distinguishing between bits of information and
chunks of information. ... The span of immediate memory seems to be almost indepen-
dent of the number of bits per chunk, at least over the range that has been examined
to date.
The contrast of the terms bit and chunk also serves to highlight the fact that we
are not very definite about what constitutes a chunk of information. For example, the
memory span of five words ... might just as appropriately have been called a memory
span of 15 phonemes, since each word had about three phonemes in it. Intuitively, it is
clear that the subjects were recalling five words, not 15 phonemes, but the logical dis-
tinction is not immediately apparent. We are dealing here with a process of organizing
or grouping the input into familiar units or chunks, and a great deal of learning has gone
into the formation of these familiar units. ” —George Miller, “The Magical Number
Seven, Plus or Minus Two: Some Limits on Our Capacity for Processing Information”
[25]
So began the story of chunking, one of the most useful and least understood phenomena
in human cognition. Although chunks are “what short term memory can hold five of,” they
appear to be incommensurate in most other respects. Miller himself was perplexed because
the information content of chunks is so different. A telephone number, which may be two
or three chunks long, is very different from a chessboard, which may also contain just a few
1
Page 2
hidden
chunks but is vastly more complex. Chunks contain other chunks, further obscuring their
information content. The psychological literature describes chunking in many experimental
situations (mostly having to do with long-term memory) but it says nothing about the
intrinsic, mathematical properties of chunks. The cognitive science literature discusses
algorithms for forming chunks, but, again, there is little said about what chunks have in
common.
Miller was close to the mark when he compared bits with chunks. Chunks may be iden-
tified by an information theoretic signature. Although chunks may contain vastly different
amounts of Shannon information, they have one thing in common: Entropy within a chunk
is relatively low, entropy at chunk boundaries is relatively high.
One sees many hints of this signature in the literature on segmentation. Many algo-
rithms for segmenting speech and text rely on some kind of information-theoretic indicator
of segment boundaries (e.g., [2, 21, 23, 33, 34, 5]). Edge-detectors in computer vision may
be described in similar terms (e.g., [29]). Intriguingly, while common compression algo-
rithms rely on a version of the “low entropy within a chunk” rule, the prediction by partial
match (PPM) method, which is apparently superior to other compression algorithms, also
attends to the entropy between chunks. One wonders whether the human perceptual and
cognitive systems have evolved to detect this “low entropy within, high entropy between”
signature. Saffran, Aslin and Newport [28] found something like it in their experiments
with speech segmentation by very young infants, and Hauser and his colleages suggest the
same mechanism is at work in cotton-top tamarin monkeys [17, 11].
More comically, have you ever wondered why young children freeze in doorways and at
the bottom of escalators, or why some people in crowded airline terminals walk out of the
jetway and stop, causing a traffic jam behind them? These are examples of transitions from
familiar to unfamiliar surroundings, of low to high local entropy. Some people can’t handle
it.
This paper introduces an unsupervised algorithm called Voting Experts for finding
chunks in sequences. As such, it has applications to segmentation, a task it performs very
well. Most of the empirical evaluations of Voting Experts are on segmentation tasks,
because these tasks provide opportunities to compare our algorithm with others. However,
segmentation is not our primary interest. If it were, we would focus on supervised segmenta-
tion (in which algorithms are trained with many examples of correct segmentation) because
it performs better than unsupervised method. Even among unsupervised segmenters, some
may perform better than Voting Experts, and we may be sure that somewhere in the
sophisticated arsenal of probabilistic algorithms trained on corpora of millions of words are
some that will beat our simple method. So segmentation is not our focus. Rather, our point
is that remarkably accurate segmentation may be accomplished by a simple algorithm with
very modest data requirements because we have identified a signature of chunks — low
entropy within, high entropy between — and designed the algorithm to look for it.
This paper is organized as follows: Section 2 introduces the statistical signature of
chunks. Section 3 describes how Voting Experts works, and Sections 4, 4.3, and 4.4
demonstrates how well it works, including comparisons with the Sequitur segmentation
algorithm, tests on corpora from several languages, and tests with time series data from a
mobile robot. Section 5 describes some related work in the cognitive sciences and computer
science.
2

Sign up today - FREE

Mendeley saves you time finding and organizing research. Learn more

  • All your research in one place
  • Add and import papers easily
  • Access it anywhere, anytime

Start using Mendeley in seconds!

Already have an account? Sign in

Readership Statistics

12 Readers on Mendeley
by Discipline
 
 
 
by Academic Status
 
58% Ph.D. Student
 
17% Other Professional
 
8% Student (Master)
by Country
 
42% United States
 
17% United Kingdom
 
8% Japan

Groups

Everything