Unsupervised Segmentation of Audio Speech Using the Voting Experts Algorithm
Proceedings of the 2nd Conference on Artificial General Intelligence 2009 (2009)
- ISBN: 9789078677246
- DOI: 10.2991/agi.2009.25
Available from www.atlantis-press.com
or
Available from www.atlantis-press.com
Page 1
Unsupervised Segmentation of Audio Speech Using the Voting Experts Algorithm
Unsupervised Segmentation of Audio Speech Using the Voting Experts Algorithm
Matthew Miller Peter Wong Alexander Stoytchev
Developmental Robotics Lab
Iowa State University
mamille@iastate.edu, pwwong@iastate.edu, alexs@iastate.edu
Abstract
Human beings have an apparently innate ability to seg-
ment continuous audio speech into words, and that abil-
ity is present in infants as young as 8 months old. This
propensity towards audio segmentation seems to lay the
groundwork for language learning. To artificially repro-
duce this ability would be both practically useful and
theoretically enlightening. In this paper we propose an
algorithm for the unsupervised segmentation of audio
speech, based on the Voting Experts (VE) algorithm,
which was originally designed to segment sequences of
discrete tokens into categorical episodes. We demon-
strate that our procedure is capable of inducing breaks
with an accuracy substantially greater than chance, and
suggest possible avenues of exploration to further in-
crease the segmentation quality.
Introduction
Human beings have an apparently innate ability to seg-
ment continuous spoken speech into words, and that abil-
ity is present in infants as young as 8 months old (Saffran,
Aslin, & Newport 1996). Presumably, the language learn-
ing process begins with learning which utterances are to-
kens of the language and which are not. Several experi-
ments have shown that this segmentation is performed in an
unsupervised manner, without requiring any external cues
about where the breaks should go (Saffran, Aslin, & New-
port 1996)(Saffran et al. 1999). Furthermore, Saffran and
others have suggested that humans use statistical properties
of the audio stream to induce segmentation. This paper pro-
poses a method for the unsupervised segmentation of spoken
speech, based on an algorithm designed to segment discrete
time series into meaningful episodes. The ability to learn
to segment audio speech is useful in and of itself, but also
opens up doorways for the exploration of more natural and
human-like language learning.
Paul Cohen has suggested an unsupervised algorithm
called Voting Experts (VE) that uses the information theo-
retical properties of internal and boundary entropy to seg-
ment discrete time series into categorical episodes (Cohen,
Adams, & Heeringa 2007). VE has previously demonstrated,
among other things, the ability to accurately segment plain
text that has had the spaces and punctuation removed. In this
paper we extend VE to work on audio data. The extension is
not a trivial or straightforward one, since VE was designed
to work on sequences of discrete tokens and audio speech is
continuous and real valued.
Additionally, it is difficult to evaluate an algorithm that
tries to find logical breaks in audio streams. In continu-
ous speech, the exact boundary between phonemes or be-
tween words is often indeterminate. Furthermore, there are
no available audio datasets with all logical breaks labeled,
and given an audio stream it is unclear where all the logical
breaks even are. What counts as a logical break? It is dif-
ficult to quantify the level of granularity with which human
beings break an audio stream, and more so to specify the
limits of any rational segmentation. This paper describes a
method to address these problems.
Our results show that we are able to segment audio
sequences with accuracy significantly better than chance.
However, given the limitations already described, we are
still not at a point to speak to the objective quality of the
segmentation.
Related Work
Our work is directly inspired by the psychological studies
of audio segmentation in human beings (Saffran, Aslin, &
Newport 1996)(Saffran et al. 1999)(Saffran et al. 1997).
These studies show us that the unsupervised segmentation
of natural language is possible, and does not require pro-
hibitively long exposure to the audio. However, these studies
do little to direct us towards a functioning algorithm capable
of such a feat.
Conversely, there are several related algorithms capa-
ble of segmenting categorical time series into episodes
(Magerman & Marcus 1990)(Kempe 1999)(Hafer & Weiss
1974)(de Marcken 1995)(Creutz 2003)(Brent 1999)(Ando
& Lee 2000). But these are typically supervised algorithms,
or not specifically suited for segmentation. In fact, many
of them have more to do with finding minimum descrip-
tion lengths of sequences than with finding logical segmen-
tations.
Gold and Scassellati have created an algorithm specifi-
cally to segment audio speech using the MDL model (Gold
& Scassellati 2006). They recorded 30 utterances by a
single speaker and used MDL techniques to compress the
representation of these utterances. They then labeled each
compressed utterance as positive or negative, depending on
whether the original utterance contained a target word, and
then trained several classifiers on the labeled data. They
Matthew Miller Peter Wong Alexander Stoytchev
Developmental Robotics Lab
Iowa State University
mamille@iastate.edu, pwwong@iastate.edu, alexs@iastate.edu
Abstract
Human beings have an apparently innate ability to seg-
ment continuous audio speech into words, and that abil-
ity is present in infants as young as 8 months old. This
propensity towards audio segmentation seems to lay the
groundwork for language learning. To artificially repro-
duce this ability would be both practically useful and
theoretically enlightening. In this paper we propose an
algorithm for the unsupervised segmentation of audio
speech, based on the Voting Experts (VE) algorithm,
which was originally designed to segment sequences of
discrete tokens into categorical episodes. We demon-
strate that our procedure is capable of inducing breaks
with an accuracy substantially greater than chance, and
suggest possible avenues of exploration to further in-
crease the segmentation quality.
Introduction
Human beings have an apparently innate ability to seg-
ment continuous spoken speech into words, and that abil-
ity is present in infants as young as 8 months old (Saffran,
Aslin, & Newport 1996). Presumably, the language learn-
ing process begins with learning which utterances are to-
kens of the language and which are not. Several experi-
ments have shown that this segmentation is performed in an
unsupervised manner, without requiring any external cues
about where the breaks should go (Saffran, Aslin, & New-
port 1996)(Saffran et al. 1999). Furthermore, Saffran and
others have suggested that humans use statistical properties
of the audio stream to induce segmentation. This paper pro-
poses a method for the unsupervised segmentation of spoken
speech, based on an algorithm designed to segment discrete
time series into meaningful episodes. The ability to learn
to segment audio speech is useful in and of itself, but also
opens up doorways for the exploration of more natural and
human-like language learning.
Paul Cohen has suggested an unsupervised algorithm
called Voting Experts (VE) that uses the information theo-
retical properties of internal and boundary entropy to seg-
ment discrete time series into categorical episodes (Cohen,
Adams, & Heeringa 2007). VE has previously demonstrated,
among other things, the ability to accurately segment plain
text that has had the spaces and punctuation removed. In this
paper we extend VE to work on audio data. The extension is
not a trivial or straightforward one, since VE was designed
to work on sequences of discrete tokens and audio speech is
continuous and real valued.
Additionally, it is difficult to evaluate an algorithm that
tries to find logical breaks in audio streams. In continu-
ous speech, the exact boundary between phonemes or be-
tween words is often indeterminate. Furthermore, there are
no available audio datasets with all logical breaks labeled,
and given an audio stream it is unclear where all the logical
breaks even are. What counts as a logical break? It is dif-
ficult to quantify the level of granularity with which human
beings break an audio stream, and more so to specify the
limits of any rational segmentation. This paper describes a
method to address these problems.
Our results show that we are able to segment audio
sequences with accuracy significantly better than chance.
However, given the limitations already described, we are
still not at a point to speak to the objective quality of the
segmentation.
Related Work
Our work is directly inspired by the psychological studies
of audio segmentation in human beings (Saffran, Aslin, &
Newport 1996)(Saffran et al. 1999)(Saffran et al. 1997).
These studies show us that the unsupervised segmentation
of natural language is possible, and does not require pro-
hibitively long exposure to the audio. However, these studies
do little to direct us towards a functioning algorithm capable
of such a feat.
Conversely, there are several related algorithms capa-
ble of segmenting categorical time series into episodes
(Magerman & Marcus 1990)(Kempe 1999)(Hafer & Weiss
1974)(de Marcken 1995)(Creutz 2003)(Brent 1999)(Ando
& Lee 2000). But these are typically supervised algorithms,
or not specifically suited for segmentation. In fact, many
of them have more to do with finding minimum descrip-
tion lengths of sequences than with finding logical segmen-
tations.
Gold and Scassellati have created an algorithm specifi-
cally to segment audio speech using the MDL model (Gold
& Scassellati 2006). They recorded 30 utterances by a
single speaker and used MDL techniques to compress the
representation of these utterances. They then labeled each
compressed utterance as positive or negative, depending on
whether the original utterance contained a target word, and
then trained several classifiers on the labeled data. They
Page 2
used these classifiers to classify utterances based on whether
they contained the target word. This technique achieved
moderate success, but the dataset was small, and it does not
produce word boundaries, which is the goal of this work.
This work makes use of the Voting Experts (VE) algo-
rithm. VE was designed to do with discrete token sequences
exactly what we are trying to do with real audio. That is,
given a large time series, specify all of the logical breaks so
as to segment the series into categorical episodes. The ma-
jor contribution of this paper lies in transforming an audio
signal so that the VE model can be applied to it.
Overview of Voting Experts
The VE algorithm is based on the hypothesis that natural
breaks in a sequence are usually accompanied by two in-
formation theoretic signatures (Cohen, Adams, & Heeringa
2007)(Shannon 1951). These are low internal entropy of
chunks, and high boundary entropy between chunks. A
chunk can be thought of as a sequence of related tokens. For
instance, if we are segmenting text, then the letters can be
grouped into chunks that represent the words.
Internal entropy can be understood as the surprise associ-
ated with seeing the group of objects together. More specif-
ically, it is the negative log of the probability of those ob-
jects being found together. Given a short sequence of tokens
taken from a longer time series, the internal entropy of the
short sequence is the negative log of the probability of find-
ing that sequence in the longer time series. So the higher the
probability of a chunk, the lower its internal entropy.
Boundary entropy is the uncertainty at the boundary of
a chunk. Given a sequence of tokens, the boundary en-
tropy is the expected information gain of being told the next
token in the time series. This is calculated as HI(c) =
−∑mh=1 P (h, c)log(P (h, c)) where c is this given sequence
of tokens, P (h, c) is the conditional probability of symbol h
following c and m is the number of tokens in the alphabet.
Well formed chunks are groups of tokens that are found to-
gether in many different circumstances, so they are some-
what unrelated to the surrounding elements. This means
that, given a subsequence, there is no particular token that
is very likely to follow that subsequence.
In order to segment a discrete time series, VE preproces-
sors the time series to build an n-gram trie, which represents
all its possible subsequences of length less than or equal to n.
It then passes a sliding window of length n over the series.
At each window location, two “experts” vote on how they
would break the contents of the window. One expert votes
to minimize the internal entropy of the induced chunks, and
the other votes to maximize the entropy at the break. The
experts use the trie to make these calculations. After all the
votes have been cast, the sequence is broken at the “peaks” -
locations that received more votes than their neighbors. This
algorithm can be run in linear time with respect to the length
of the sequence, and can be used to segment very long se-
quences. For further details, see the journal article (Cohen,
Adams, & Heeringa 2007).
It is important to emphasize the VE model over the actual
implementation of VE. The goal of our work is to segment
audio speech based on these information theoretic markers,
and to evaluate how well they work for this task. In order
to do this, we use a particular implementation of Voting Ex-
perts, and transform the audio data into a format it can use.
This is not necessarily the best way to apply this model to
audio segmentation. But it is one way to use this model to
segment audio speech.
The model of segmenting based on low internal en-
tropy and high boundary entropy is also closely related to
the work in psychology mentioned above (Saffran et al.
1999). Specifically, they suggest that humans segment au-
dio streams based on conditional probability. That is, given
two phonemes A and B, we conclude that AB is part of a
word if the conditional probability of B occurring after A is
high. Similarly, we conclude that AB is not part of a word if
the conditional probability of B given A is low. The informa-
tion theoretic markers of VE are simply a more sophisticated
characterization of exactly this idea. Internal entropy is di-
rectly related to the conditional probability inside of words.
And boundary entropy is directly related to the conditional
probability between words. So we would like to be able to
use VE to segment audio speech, both to test this hypothesis
and to possibly facilitate natural language learning.
Experimental Procedure
Our procedure can be broken down into three steps. 1) Tem-
porally discretize the audio sequence while retaining the rel-
evant information. 2) Tokenize the discrete sequence. 3)
Apply VE to the tokenized sequence to obtain the logical
breaks. These three steps are described in detail below, and
illustrated in Figure 2.
100 200 300 400 500 600 700 800
5
10
15
20
25
30
Figure 1: A voiceprint of the first few seconds of one of our
audio datasets. The vertical axis represents 33 frequency
bins and the horizontal axis represents time. The intensity
of each frequency is represented by the color. Each vertical
line of pixels then represents a spectrogram calculated over
a short Hamming window at a specific point in time.
Step 1
In order to discritize the sequence, we used the discrete
Fourier transform in the Sphinx software package to ob-
tain the spectrogram information (Walker et al. 2004). We
also took advantage of the raised cosine windower and the
pre-emphasizer in Sphinx. The audio stream was windowed
into 26.6ms wide segments called Hamming windows, taken
every 10ms (i.e. the windows were overlapping). The
windower also applied a transformation on the window to
emphasize the central samples and de-emphasize those on
the edge. Then the pre-emphasizer normalized the volume
across the frequency spectrum. This compensates for the
natural attenuation (decrease in intensity) of sound as the
frequency is increased.
Finally we used the discrete Fourier Transform to obtain
the spectrogram. This is a very standard procedure to obtain
they contained the target word. This technique achieved
moderate success, but the dataset was small, and it does not
produce word boundaries, which is the goal of this work.
This work makes use of the Voting Experts (VE) algo-
rithm. VE was designed to do with discrete token sequences
exactly what we are trying to do with real audio. That is,
given a large time series, specify all of the logical breaks so
as to segment the series into categorical episodes. The ma-
jor contribution of this paper lies in transforming an audio
signal so that the VE model can be applied to it.
Overview of Voting Experts
The VE algorithm is based on the hypothesis that natural
breaks in a sequence are usually accompanied by two in-
formation theoretic signatures (Cohen, Adams, & Heeringa
2007)(Shannon 1951). These are low internal entropy of
chunks, and high boundary entropy between chunks. A
chunk can be thought of as a sequence of related tokens. For
instance, if we are segmenting text, then the letters can be
grouped into chunks that represent the words.
Internal entropy can be understood as the surprise associ-
ated with seeing the group of objects together. More specif-
ically, it is the negative log of the probability of those ob-
jects being found together. Given a short sequence of tokens
taken from a longer time series, the internal entropy of the
short sequence is the negative log of the probability of find-
ing that sequence in the longer time series. So the higher the
probability of a chunk, the lower its internal entropy.
Boundary entropy is the uncertainty at the boundary of
a chunk. Given a sequence of tokens, the boundary en-
tropy is the expected information gain of being told the next
token in the time series. This is calculated as HI(c) =
−∑mh=1 P (h, c)log(P (h, c)) where c is this given sequence
of tokens, P (h, c) is the conditional probability of symbol h
following c and m is the number of tokens in the alphabet.
Well formed chunks are groups of tokens that are found to-
gether in many different circumstances, so they are some-
what unrelated to the surrounding elements. This means
that, given a subsequence, there is no particular token that
is very likely to follow that subsequence.
In order to segment a discrete time series, VE preproces-
sors the time series to build an n-gram trie, which represents
all its possible subsequences of length less than or equal to n.
It then passes a sliding window of length n over the series.
At each window location, two “experts” vote on how they
would break the contents of the window. One expert votes
to minimize the internal entropy of the induced chunks, and
the other votes to maximize the entropy at the break. The
experts use the trie to make these calculations. After all the
votes have been cast, the sequence is broken at the “peaks” -
locations that received more votes than their neighbors. This
algorithm can be run in linear time with respect to the length
of the sequence, and can be used to segment very long se-
quences. For further details, see the journal article (Cohen,
Adams, & Heeringa 2007).
It is important to emphasize the VE model over the actual
implementation of VE. The goal of our work is to segment
audio speech based on these information theoretic markers,
and to evaluate how well they work for this task. In order
to do this, we use a particular implementation of Voting Ex-
perts, and transform the audio data into a format it can use.
This is not necessarily the best way to apply this model to
audio segmentation. But it is one way to use this model to
segment audio speech.
The model of segmenting based on low internal en-
tropy and high boundary entropy is also closely related to
the work in psychology mentioned above (Saffran et al.
1999). Specifically, they suggest that humans segment au-
dio streams based on conditional probability. That is, given
two phonemes A and B, we conclude that AB is part of a
word if the conditional probability of B occurring after A is
high. Similarly, we conclude that AB is not part of a word if
the conditional probability of B given A is low. The informa-
tion theoretic markers of VE are simply a more sophisticated
characterization of exactly this idea. Internal entropy is di-
rectly related to the conditional probability inside of words.
And boundary entropy is directly related to the conditional
probability between words. So we would like to be able to
use VE to segment audio speech, both to test this hypothesis
and to possibly facilitate natural language learning.
Experimental Procedure
Our procedure can be broken down into three steps. 1) Tem-
porally discretize the audio sequence while retaining the rel-
evant information. 2) Tokenize the discrete sequence. 3)
Apply VE to the tokenized sequence to obtain the logical
breaks. These three steps are described in detail below, and
illustrated in Figure 2.
100 200 300 400 500 600 700 800
5
10
15
20
25
30
Figure 1: A voiceprint of the first few seconds of one of our
audio datasets. The vertical axis represents 33 frequency
bins and the horizontal axis represents time. The intensity
of each frequency is represented by the color. Each vertical
line of pixels then represents a spectrogram calculated over
a short Hamming window at a specific point in time.
Step 1
In order to discritize the sequence, we used the discrete
Fourier transform in the Sphinx software package to ob-
tain the spectrogram information (Walker et al. 2004). We
also took advantage of the raised cosine windower and the
pre-emphasizer in Sphinx. The audio stream was windowed
into 26.6ms wide segments called Hamming windows, taken
every 10ms (i.e. the windows were overlapping). The
windower also applied a transformation on the window to
emphasize the central samples and de-emphasize those on
the edge. Then the pre-emphasizer normalized the volume
across the frequency spectrum. This compensates for the
natural attenuation (decrease in intensity) of sound as the
frequency is increased.
Finally we used the discrete Fourier Transform to obtain
the spectrogram. This is a very standard procedure to obtain
Sign up today - FREE
Mendeley saves you time finding and organizing research. Learn more
- All your research in one place
- Add and import papers easily
- Access it anywhere, anytime
Start using Mendeley in seconds!
Readership Statistics
8 Readers on Mendeley
by Discipline
25% Engineering
13% Linguistics
by Academic Status
25% Ph.D. Student
13% Student (Master)
13% Student (Bachelor)
by Country
63% United States
13% Germany
13% France


