Sign up & Download
Sign in

Bayesian online algorithms for learning in discrete Hidden Markov Models

by Roberto C Alamino, Nestor Caticha
Discrete and Continuous Dynamical Systems Series B (2007)

Abstract

We propose and analyze two different Bayesian online algorithms for learning in discrete Hidden Markov Models and compare their performance with the already known Baldi-Chauvin Algorithm. Using the Kullback-Leibler divergence as a measure of generalization we draw learning curves in simplified situations for these algorithms and compare their performances.

Author-supplied keywords

Cite this document (BETA)

Available from Roberto Alamino's profile on Mendeley.
Page 1
hidden

Bayesian online algorithms for learning in discrete Hidden Markov Models

Manuscript submitted to Website: http://AIMsciences.org
AIMS’ Journals
Volume X, Number 0X, XX 200X pp. X–XX
BAYESIAN ONLINE ALGORITHMS FOR LEARNING IN
DISCRETE HIDDEN MARKOV MODELS
Roberto C. Alamino
Neural Computing Research Group, Aston University
Main Building, Birmingham, B7 4ET, United Kingdom
Nestor Caticha
Instituto de F´ısica, Universidade de Sa˜o Paulo
CP 66318, Sa˜o Paulo, SP, CEP 05389-970, Brazil
(Communicated by the associate editor name)
Abstract. We propose and analyze two different Bayesian online algorithms
for learning in discrete Hidden Markov Models and compare their performance
with the already known Baldi-Chauvin Algorithm. Using the Kullback-Leibler
divergence as a measure of generalization we draw learning curves in simplified
situations for these algorithms and compare their performances.
1. Introduction. The unifying perspective of the Bayesian approach to machine
learning allows the construction of efficient algorithms and sheds light on the char-
acteristics they should have in order to attain such efficiency. In this paper we
construct and characterize the performance of mean field online algorithms for dis-
crete Hidden Markov Models (HMM) [5, 9] derived from approximations to a fully
Bayesian algorithm.
HMMs form a class of graphical models used to model the behavior of time series.
They have a wide range of applications which includes speech recognition [9], DNA
and protein analysis [3, 4] and econometrics [10].
Discrete HMMs are defined by an underlying Markov chain with hidden states qt,
t = 1, ..., T , in the set S = {s1, s2, ..., sn}, initial probability vector pii = P(q1 = si)
and transition matrix Aij(t) = P(qt+1 = sj|qt = si), with i, j = 1, .., n. At each
time step, the state qt emits a state yt in the set O = {o1, ..., om} with a probability
given by Biα(t) = P(yt = oα|qt = si), with i = 1, ..., n and α = 1, ...,m. When A
and B do not depend on time the HMM is said homogeneous, this is a simplifying
assumption, which is not needed for online learning.
The observed states yt of the HMM represent the observations of the time series,
i.e., a time series from time t = 1 to t = T is represented by the observed sequence
yT1 = {y1, y2, ..., yT }. The unavailable or hidden states qt form the so called hidden
sequence qT1 = {q1, q2, ..., qT }.
2000 Mathematics Subject Classification. Primary: 68T05; Secondary: 60J20, 62F15.
Key words and phrases. HMM, online algorithm, generalization error, Bayesian algorithm.
This work was supported mainly by FAPESP and partially by EVERGROW, IP no. 1935 in
FP6 of the EU.
1
Page 2
hidden
2 ROBERTO C. ALAMINO AND NESTOR CATICHA
The parameters that define the HMM are pi, A and B which we denote compactly
as ω = (pi,A,B). The probability of observing a particular sequence yT1 given
parameters ω is
P(yT1 |ω) =

qT1
P(yT1 , qT1 |ω)
=

qT1
P(y1)P(y1|q1)
T

t=2
P(qt+1|qt)P(yt|qt).
(1)
Most applications consist in adapting the parameters of the HMM in order to
produce sequences which mimic the behavior of some given time series. This is the
learning process on the HMM. Depending on how the data is presented it can range
from offline to online. In the former, the whole dataset, usually given by just one
observed sequence yT1 , is presented to the algorithm and the optimal parameters
are calculated all at once. In the later, the dataset, usually composed of a certain
number P of independently generated observed sequences, is presented only a part
at a time after which, a partial calculation of the parameters is made. We will
concentrate our analysis in the last case.
Several merit figures can be used to measure the performance of a learning al-
gorithm. For the sake of simplicity we will study a scenario where the data set
has been generated by a HMM of unknown parameters. This is an extension of
the student-teacher scenario where the statistical mechanical properties of neural
networks have been extensively studied. The performance of the learning process,
as a function of the number of observations P , is given by how far, as measured
by some suitable defined criterion, lies the student HMM from the teacher HMM.
In this paper we only concentrate in the naturally arising Kullback-Leibler diver-
gence (KL-divergence), which although not immediately usable in practice, since it
needs knowledge from the unavailable teacher, is a simple extension of the idea of
generalization error and therefore can be very informative.
We propose two algorithms for learning on HMMs and compare their perfor-
mances with the online Baldi-Chauvin Algorithm (BC) [2] where a softmax repre-
sentation of the HMM parameters is used to calculate a maximum likelihood esti-
mate. Starting from a Bayesian formulation of the learning problem we introduce
a Bayesian Online Algorithm (BOnA) for HMMs. This can be simplified, without
noticeable deterioration of the performance, leading to a Mean Posterior Algorithm
(MPA). Using the KL-divergence, we draw learning curves for these algorithms in
simplified situations and compare with the BC algorithm.
These methods, inspired by the works of Opper [7] and Amari [1] are essentially
mean field methods [8]. The Bayesian prescription for the posterior usually leads
to a distribution that cannot be handled in practice. The idea is to introduce a
manifold of tractable distributions which will be used as priors. The information
of the new datum leads, through Bayes theorem, to a non-tractable posterior. The
basic inference step is to consider as the new prior to be used with the next datum,
not the posterior itself, but one of the tractable distributions. Which one? The
closest, in some sense. The choice of distance defines the algorithm.
The rest of the paper is organized as follows: in section 2 we derive the Bayesian
online algorithm for HMMs and propose the MPA approximation for it. We present
conclusions and perspectives for future works in section 3.
Page 3
hidden
BAYESIAN ONLINE LEARNING IN DHMMS 3
2. The Bayesian Online Algorithm. The Bayesian Online Algorithm (BOnA),
which can be found in [7], is a method for estimating parameters using the power
of Bayesian inference.
Our task here is to adjust the HMM parameters ω using the information con-
tained in a dataset given by
DP = {y1, ..., yP }. (2)
To accomplish it in a Bayesian way, we need to estimate the parameters us-
ing a probability distribution P(ω|DP ) which encodes the information about the
probability of the parameters after the observation of DP . Assuming that the ob-
servations in DP are independent, at each new sequence we can update our previous
distribution in a new posterior using Bayes’ theorem.
Figure 1. In gray we show the manifold of the tractable distri-
butions. The two-step algorithm is represented by the arrows: a)
1st. step: a new observation modifies the former distribution tak-
ing it away from the parametric family manifold. b) 2nd. step: the
modified distribution is projected back into the original manifold
by minimization of the KL-divergence.
If we choose a prior distribution belonging to a parametric family with parameters
λ, after one update the posterior will no longer be in this family. The method
consists in projecting this posterior back into the parametric manifold by minimizing
the KL-divergence
DKL(P(ω|λP , yP+1)||P(ω|λP+1)) =

P(ω|λP , yP+1) ln P(ω|λ
P , yP+1)
P(ω|λP+1) dω, (3)
what is represented schematically in figure 1.
At each point in the learning process, the estimative for ω that we will choose
will be its mean in the final distribution.
The choice of the parametric family of distributions will be dictated by the par-
ticular problem we are dealing with. When the parametric family is of the form
P (x) = 1Z e

P
i λifi(x), (4)
which is the case when the distribution is obtained by the maximum entropy method
using as constraints the average value of n arbitrary functions fi(x)
〈fi(x)〉P ≡

dxfi(x)P (x), i = 1, ..., n, (5)
Page 4
hidden
4 ROBERTO C. ALAMINO AND NESTOR CATICHA
minimizing the KL-divergence turns out to be equivalent to equate the average of
the n constraint functions
〈fj〉P = 〈fj〉Q , (6)
where the indices indicate in which distribution we need to take the averages and
Q(x) is the still unprojected posterior.
2.1. Bayesian Online Algorithm for HMMs. In order to apply the BOnA to
the case of HMMs, we need to choose an appropriate parametric family. Making
the simplifying assumption that matrices pi, A and B are independent, we can write
a factorized distribution
P(ω|u) ≡ P(pi|ρ)P(A|a)P(B|b), (7)
where u = (ρ, a, b) are parameters of the distributions. The vector pi and each row
of the matrices A and B are discrete probability distributions for the initial states,
the transition of a given hidden state and the emission of an observed state by a
given hidden one respectively. As they are different distributions, we will assume
that they can be treated independently and factorize the distribution (7) once again
using
P(A|a) =
n

i=1
P(Ai|ai), P(B|b) =
n

i=1
P(Bi|bi), (8)
where
Ai ≡ (Ai1, ..., Ain), Bi ≡ (Bi1, ..., Bim), (9)
which gives for (7)
P(ω|u) ≡ P(pi|ρ)
n

i=1
P(Ai|ai)P(Bi|bi). (10)
where each factor is a density of probability that a probability has a given value. A
natural choice is the family of Dirichlet distributions
D(x|u) = Γ(u0)
∏N
i=1 Γ(ui)
N

i=1
xui−1i , (11)
where x = (x1, ..., xN ) is a N -dimensional real vector and u = (u1, ..., uN) subject
to the constraints
0 ≤ xi ≤ 1,
N

i=1
xi = 1, ui > 0, (12)
and u0 =

i ui.
As (10) is a factorized distribution, the minimization of the KL-divergence can
be made separately for each factor distribution. The Dirichlet distribution can be
obtained from the maximum entropy method using as constraints the values of the
averages of the logarithms of the variables [11]
0 ≤ xi ≤ 1,

i
xi = 1, (13)
and

dµD(x) = 1,

dµD(x) ln xi = αi, (14)
Page 5
hidden
BAYESIAN ONLINE LEARNING IN DHMMS 5
where dµ is defined as
dµ ≡ δ
(

i
xi − 1
)

i
θ(xi)dxi. (15)
Extremizing the entropy subject to the constraints is equivalent to extremizing
L =

dµD lnD + λ
(∫
dµD − 1
)
+

i
λi
(∫
dµD lnxi − αi
)
, (16)
and applying δL/δD = 0 we have
δ
(

i
xi − 1
)[

i
θ(xi)
](
1 + lnD + λ+

i
λi lnxi
)
= 0, (17)
with solution
D(x) = exp(−λ− 1)

i
x−λii , (18)
which is the Dirichlet distribution for Z = exp(λ+ 1) and ui = 1− λi.
So, the minimization of the KL-divergence corresponds to equating the average
of the logarithms in the original and projected distributions. The final result is
ψ(ρi)− ψ



j
ρj

 = 〈lnpii〉Q ≡ µi(ρ), (19)
ψ(aij)− ψ
(

k
aik
)
= 〈lnAij〉Q ≡ µij(a),
ψ(biα)− ψ



β
biβ

 = 〈lnBiα〉Q ≡ µiα(b),
where Q is the posterior distribution and ψ is the digamma function, the logarithm
derivative of the gamma function
ψ(x) = ddx ln Γ(x) =
Γ′(x)
Γ(x) . (20)
We will call a set of N equations of the form
ψ(xi)− ψ



j
xj

 = µi, (21)
with i = 1, ...N a digamma system in the variables xi with coefficients µi.
To solve the problem we need two more steps: calculate the coefficients µ and
solve the digamma system (19).
Let us call P p(ω) the projected distribution after observation of the sequence yp,
and Qp+1(ω) the posterior distribution (not projected yet) after the observation of
the sequence yp+1. Then, by Bayes’ theorem,
Qp+1(ω) = 1ZQ
P p(ω)

qp+1
P(yp+1, qp+1|ω), (22)
ZQ =

qp+1

dω P p(ω)P(yp+1, qp+1|ω). (23)
Page 6
hidden
6 ROBERTO C. ALAMINO AND NESTOR CATICHA
Due to the summation over hidden states, the posterior distribution is a mixture
of Dirichlets. If we had access to these states, the posterior would be a simple
Dirichlet distribution and we would not need the projection step. The projected
distribution P p is, by construction, the product distribution
P p(ω) = P(pi|ρp)P(A|ap)P(B|bp), (24)
where ρp, ap and bp are the values of the parameters u after observation of the
sequence yp. From here on, let us suppress the indices relative to the sequences by
adopting the convention that primed variables correspond to the upper index p+1
and unprimed to p. Then, equation (22) can be written as
Q′(ω) = 1ZQ′
P(pi|ρ)P(A|a)P(B|b)

q′
P(y′, q′|pi,A,B), (25)
ZQ′ =

q′

dpi dAdB P(pi|ρ)P(A|a)P(B|b)P(y′, q′|pi,A,B). (26)
Using the notation
q′t = si ⇒ pi(q′t) ≡ pii, (27)
q′t = si, q′t+1 = sj ⇒ A(q′t, q′t+1) ≡ Aij ,
q′t = si, y′t = oα ⇒ B(q′t, y′t) ≡ Biα,
the normalization of the posterior can be factored as
ZQ′ =

q′

dpiP(pi|ρ)pi(q′1) ·

dAP(A|a)
T ′−1

t=1
A(q′t, q′t+1)
×

dB P(B|b)
T ′

t=1
B(q′t, y′t),
(28)
where T ′ is the length of the sequence y′ ≡ yp+1. The integrals on A and B can be
factored once more in the rows of these matrices:
ZQ′ =

q′

dpiP(pi|ρ)pi(q′1) ·

i

dAi P(Ai|ai)

q′t=si
A(q′t, q′t+1)
×

i

dBi P(Bi|bi)

q′t=si
B(q′t, y′t).
(29)
The calculation of the µ coefficients in (19) now involves expectations over Dirich-
let distributions of the form
µi =




j
xrjj

 lnxi

. (30)
Taking the average over the Dirichlet distribution D(x|u), we have
µi =


j
xrjj

[ψ(ui + ri)− ψ(u0 + r0)], (31)
with


j
xrjj

= Γ(u0)∏
j Γ(uj)

j Γ(uj + rj)
Γ(u0 + r0)
, (32)
Page 7
hidden
BAYESIAN ONLINE LEARNING IN DHMMS 7
solving the problem of calculating the coefficients of the digamma systems.
2.1.1. Digamma Systems. The next step is to solve the digamma systems them-
selves. In order to find a solution for the general form
ψ(xi)− ψ



j
xj

 = µi, i = 1, ..., d, (33)
we devised the simple method of transforming these equations in a one-dimensional
map and find numerically the fixed point by iterating from an arbitrary initial point.
What we need to do is to solve for xi
xi = ψ−1

µi + ψ



j
xj



 . (34)
Summation over i using the definition x0 ≡

i xi gives
x0 =

i
ψ−1[µi + ψ(x0)]. (35)
We solve this equation by iteration of the map
x(n+1)0 =

i
ψ−1[µi + ψ(x(n)0 )]. (36)
This completes the algorithm. BOnA suffers from a common problem of Bayesian
algorithms: we have to sum over the hidden variables making the complexity scales
as nT , i.e., exponential in the length of the observed sequences. In the BOnA case
we have a worse situation due to the great quantity of digamma functions needed to
be calculated numerically. Everything summed up makes the algorithm very time
consuming. In the next section, we will develop an approximation that runs faster,
although still have exponential complexity in T . This is however not a problem
since T can be kept fixed and small without depending on the total number of
examples, this dependence must be stressed is polynomial.
2.2. Mean Posterior Approximation. The algorithm we present now will be
called the Mean Posterior Approximation (MPA) and it is a simplification of the
BOnA that is itself inspired in the results of its application to Gaussian distribu-
tions.
In the BOnA, we projected the posterior distribution into a parametric distribu-
tion by minimizing the KL-divergence. For Gaussian distributions this is equivalent
to matching the first and second moments of the posterior and projected distribu-
tions. Observing this fact, we decided instead of minimizing the KL-divergence
matching the mean and one of the variances of the posterior with those of the
projected distributions.
Page 8
hidden
8 ROBERTO C. ALAMINO AND NESTOR CATICHA
With a hat over variables indicating the new estimated values, the matching of
the means is given by
pˆii = 〈pii〉Q′ ⇒ ρˆi = 〈pii〉Q′

j
ρˆj , (37)
Aˆij = 〈Aij〉Q′ ⇒ aˆij = 〈Aij〉Q′

k
aˆik,
Bˆiα = 〈Biα〉Q′ ⇒ bˆiα = 〈Biα〉Q′

β
bˆiβ.
We still have some freedom that can be fixed by matching the first variance of
each Dirichlet distribution. For the initial probabilities pi, defining ρ0 ≡

i ρi, this
will be given by

pi21

Q′ − 〈pi1〉
2
Q′ =
ρˆ1(ρˆ0 − ρˆ1)
ρˆ20(ρˆ0 + 1)
. (38)
After some algebra we find that the final update equations for the parameters
are
ρˆi = 〈pii〉Q′
〈pi1〉Q′ −

pi21

Q′
〈pi21〉Q′ − 〈pi1〉
2
Q′
, (39)
aˆij = 〈aij〉Q′
〈ai1〉Q′ −

a2i1

Q′
〈a2i1〉Q′ − 〈ai1〉
2
Q′
,
bˆiα = 〈biα〉Q′
〈bi1〉Q′ −

b2i1

Q′
〈b2i1〉Q′ − 〈bi1〉
2
Q′
.
1 10 100
p
0.1
d
(K
ull
ba
ck
-L
eib
ler
)
Figure 2. Comparison between the learning curves of MPA
(dashed line) and BOnA (circles). Logarithmic scale.
These formulas set the values of uˆ in the projected distribution. Figure 2 shows
that the performance of MPA and BOnA are different in the beginning of the learn-
ing process, but as more sequences are processed, the loss of information becomes
smaller and both algorithms come closer relatively fast. It is interesting to observe
that MPA is always best than BOnA in the simulation, an effect that needs more
studies to be fully explained. For this simulation we used n = 2, m = 3 and T = 2
and made the average over 150 random teachers. The initial student is symmetric.
Page 9
hidden
BAYESIAN ONLINE LEARNING IN DHMMS 9
The real computational time for BOnA was 340 minutes, while for MPA was 5
seconds in a 1GHz processor.
10 100 1000 10000
p
0.1
d
(K
ull
ba
ck
-L
eib
ler
)
0.001
0.01
0.1
Figure 3. Comparison between MPA (dashed line) and Baldi-
Chauvin (continuous lines). The values next to the curves indicate
λ for the corresponding simulation. The learning rate was fixed to
ηBC = 0.5. Logarithmic scale.
Figure 3 compares the performance of MPA to Baldi-Chauvin showing that the
generalization ability of MPA is superior. The parameters of the simulations are
n = 2, m = 3 and T = 2. The curves are the result of the average over 500 random
teachers with symmetric initial students.
3. Conclusions. We proposed and analyzed two Bayesian algorithms for online
learning in discrete HMMs: the full Bayesian Online Algorithm (BOnA) and the
Mean Posterior Approximation (MPA) to the BOnA.
After introducing the BOnA, we developed a simplification that runs faster which
we called MPA. The MPA was then compared with the Baldi-Chauvin algorithm and
shown to perform better with respect to the generalization ability. Although MPA
scales exponentially with the size of the sequence like BOnA, it is not a problem
for we can fix this size to some small value and still have a good generalization.
We have applied these algorithms to learning real data time series with good
preliminary results. We have also studied the effects of drifting rules and of sharp
changes in the series. These results will be published elsewhere.
Acknowledgements. In addition to the funding cited in the beginning of this
paper, we would like to thank also Evaldo Oliveira, Manfred Opper and Lehel
Csato. This work was made mostly in the University of Sa˜o Paulo, Brazil and part
in Aston University, UK. We would also like to thank Prof. Ruedi Stoop for the
kind invitation for contributing with this work.
REFERENCES
[1] S. Amari, Neural learning in structured parameter spaces - natural Riemannian gradient
NIPS’96 9, MIT Press (1996).
[2] P. Baldi and Y. Chauvin Smooth on-line learning algorithms for hidden Markov models Neural
Computation 6 (1994), 307–318.
[3] P. Baldi and S. Brunak “Bioinformatics: The Machine Learning Approach” (Second Edition)
MIT Press, 2001.
Page 10
hidden
10 ROBERTO C. ALAMINO AND NESTOR CATICHA
[4] R. Durbin, S. Eddy, A. Krogh and Mitchison, G. “Biological Sequence Analysis: Probabilistic
Models of Proteins and Nucleic Acids” Cambridge University Press, Cambridge, 1998.
[5] Y. Ephraim and N. Merhav Hidden Markov processes IEEE Trans. Inf. Theory 48 (2002),
1518–1569.
[6] T. Heskes and W. Wiegerinck On-line learning with time-correlated examples in “On-line
Learning in Neural Networks” (Ed. D. Saad), Cambridge University Press, Cambridge, (1998),
251–278.
[7] M. Opper A Bayesian approach to on-line learning in “On-line Learning in Neural Networks”
(Ed. D. Saad), Publications of the Newton Institute, Cambridge Press, Cambridge, (1998),
363–378.
[8] M. Opper and D. Saad “Advanced Mean Field Methods: Theory and Practice” The MIT
Press, 2001.
[9] L. R. Rabiner A tutorial on hidden Markov models and selected applications in speech recog-
nition Proc. IEEE 77 (1989), 257–286.
[10] T. Ryde´n, T. Terasvirta and S. Asbrink Stylized facts of daily return series and the hidden
Markov model J. Applied Econometrics 13 (1998), 217-244.
[11] M. O. Vlad, M. Tsuchiya, P. Oefner and J. Ross Bayesian analysis of systems with ran-
dom chemical composition: renormalization-group approach to Dirichlet distributions and
the statistical theory of dilution Phys. Rev. E 65 (2001), 011112(1)–01112(8).
Received September 2006; revised February 2007.
E-mail address: alaminrc@aston.ac.uk
E-mail address: nestor@if.usp.br

Sign up today - FREE

Mendeley saves you time finding and organizing research. Learn more

  • All your research in one place
  • Add and import papers easily
  • Access it anywhere, anytime

Start using Mendeley in seconds!

Already have an account? Sign in

Readership Statistics

1 Reader on Mendeley
by Discipline
 
100% Physics
by Academic Status
 
100% Researcher (at an Academic Institution)
by Country
 
100% United Kingdom