Sign up & Download
Sign in

Temporal decomposition for the initialization of a HMM isolated word-recognizer

by M Taylor, F Bimbot
IEEE International Conference on Acoustics Speech and Signal Processing (1992)

Abstract

The technique of temporal decomposition is used to initialize continuous density hidden Markov models. The temporal decomposition process produces a representation of each word in terms of a set of target vectors and interpolation functions. Roughly speaking, the target vectors represent the centers of the important acoustic events, and the interpolation functions describe a spectral path between these events. The number of targets generated by the temporal decomposition process is taken to be the number of states used for the HMM, and the position, shape and length of the interpolation functions are used to provide initial estimates for the transition probabilities and observation probability densities of the HMM. The performance of such a system is assessed for a single-speaker environment

Cite this document (BETA)

Available from ieeexplore.ieee.org
Page 1
hidden

Temporal decomposition for the initialization of a HMM isolated word-recognizer

TEMPORAL DECOMPOSITION FOR THE INITIALIZATION
OF A HMM ISOLATED WORD-RECOGNIZER
M. TAYLOR, F. BIMBOT.
T6leCom Paris - Dept Signal, C.N.R.S. - URA 820,46 rue Barrault, 75634 PARIS Cedex 13, FRANCE.
ABSTRACT
In this paper, the technique of temporal decomposition is used
to initialize continuous density Hidden Markov Models. The
temporal decomposition process produces a representation of
each word in terms of a set of target vectors and interpolation
functions. Roughly speaking, the target vectors represent the
centres of the important acoustic events, and the interpolation
functions describe a spectral path between these events [l]. In
our approach, the number of targets generated by the temporal
decomposition process is taken to be the number of states
used for the HMM. and the position, shape and length of the
interpolation functions are used to provide initial estimates
for the transtition probabilities and observation probability
densities of the HMM. The performance of such a system is
assessed for a single-speaker environment.
1 INTRODUCTION
Many successful speech recognition systems are based on
Hidden Markov Model methods. These models are defied by a
set of parameters which are found using an iterative re-
estimation algorithm on a set of training repetitions of each
word in the vocabulary.
The problem of initialization of this training phase for the
continuous density HMM is examined. Various methods
involving clustering techniques in use today [2] generate
estimates of the number of coherent spectral zones, their
duration and their spectral characteristics. The one that is
proposed in this study is based on temporal decomposition [3]
[4]. Under certain circumstances, it is shown that this
technique can be used to find appropriate initial estimates of
the HMM parameters that can subsequently be used as the
input for the classical training process.
2 TEMPORAL DECOMPOSITION
Temporal decomposition (TD) is a model of speech spectral
evolution where a sequence of spectral parameters (LPC
coefficients, for instance) is described as a linear combination
of a limited set of vectors, called targets. The time-
contribution of each target is given by non-uniformly spaced
interpolation functions that are constrained to a limited time
extent (compact functions). Interpolation functions overlap
with each other, which is a way to model transitions between
successive sounds.
The evolution of N spectral parameters (m -dimensional
vectors) y (t) is modeled as the linear combination of a
number n of m-dimensional target vectors gk. weighted by
scalar compact interpolation functions 0k(t) :
Y(t)=;(t)= C @ k ( t ) g k .
n
k= 1
wherey (t) denotes an estimation of y (t). and 1 I t I N.
The unknown parameters of the model are n. { 8 k ) and [gk].
0-functions and n are estimated f is t by the so-called Adaptive
Windowing process [4]. Targets gk and 8-functions are then
alternatively re-estimated by "Iterative Refinement" [3].
Figure 1 below shows an example of the temporal
decomposition of the italian word "chiama" ([kjamal).
Figure 1 : an example of temporal decomposition
for italian word "chiama" ([kjama]).
Signal, spectrogram and @-functions.
Note that a threshold in the temporal decomposition
algorithm has a direct influence on the final number of
interpolation functions (n). However, the exact number of
acoustic events per word cannot be chosen a priori, but, with
an increasing threshold, the value of n tends to decrease.
3 HIDDEN MARKOV MODELS
A Hidden Markov Model (HMM) can be used to represent a
given speech segment in a stochastic manner. It models each
of the various quasi-stationary spectral zones characteristic of
a particular word as a state in a Markov chain with an
associated observation probability density function [5 ] . If
there are fewer states in the HMM than there are spectral zones
in the speech portion, the states have to model two or more
spectrally incoherent regions, which leads to poor
recognition performance. If the number of states is too large,
the training process becomes very time-consuming.
1-369
0-7803-0532-9192 $3.00 0 1992 IEEE

Sign up today - FREE

Mendeley saves you time finding and organizing research. Learn more

  • All your research in one place
  • Add and import papers easily
  • Access it anywhere, anytime

Start using Mendeley in seconds!

Already have an account? Sign in

Readership Statistics

2 Readers on Mendeley
by Discipline
 
by Academic Status
 
50% Student (Master)
 
50% Doctoral Student
by Country
 
100% Germany