Online Learning in Discrete Hidden Markov Models
Aip Conference Proceedings (2007)
- DOI: 10.1063/1.2423274
- arXiv: 0708.2377
Available from
Roberto Alamino's profile on Mendeley.
or
Abstract
We present and analyse three online algorithms for learning in discrete Hidden Markov Models (HMMs) and compare them with the Baldi-Chauvin Algorithm. Using the Kullback-Leibler divergence as a measure of generalisation error we draw learning curves in simplified situations. The performance for learning drifting concepts of one of the presented algorithms is analysed and compared with the Baldi-Chauvin algorithm in the same situations. A brief discussion about learning and symmetry breaking based on our results is also presented.
Available from
Roberto Alamino's profile on Mendeley.
Page 1
Online Learning in Discrete Hidden Markov Models
ar
X
iv
:0
70
8.
23
77
v1
[
sta
t.M
L]
1
7 A
ug
20
07
Online Learning in Discrete Hidden Markov
Models
Roberto Alamino∗ and Nestor Caticha†
∗Neural Computing Research Group,
Aston University
Aston Triangle, Birmingham, B4 7ET, United Kingdom
†Instituto de Física,
Universidade de São Paulo,
CP 66318 São Paulo, SP, CEP 05389-970 Brazil
Abstract. We present and analyse three online algorithms for learning in discrete Hidden Markov
Models (HMMs) and compare them with the Baldi-Chauvin Algorithm. Using the Kullback-Leibler
divergence as a measure of generalisation error we draw learning curves in simplified situations.
The performance for learning drifting concepts of one of the presented algorithms is analysed
and compared with the Baldi-Chauvin algorithm in the same situations. A brief discussion about
learning and symmetry breaking based on our results is also presented.
Key Words: HMMs, Online Algorithm, Generalisation Error, Bayesian Algorithm.
INTRODUCTION
Hidden Markov Models (HMMs) [1, 2] are extensively studied machine learning models
for time series with several applications in fields like speech recognition [2], bioinfor-
matics [3, 4] and LDPC codes [5]. They consist of a Markov chain of non-observable
hidden states qt ∈ S, t = 1, ...,T , S = {s1,s2, ...,sn}, with initial probability vector
pii = P(q1 = si) and transition matrix Aij(t) = P(qt+1 = sj |qt = si), i, j = 1, ..,n. At
discrete times t, each qt emits an observed state yt ∈ O, O = {o1, ...,om}, with emis-
sion probability matrix Biα(t) = P(yt = oα|qt = si), i= 1, ...,n, α = 1, ...,m, which are
the actual observations of the time series represented, from time t = 1 to t = T , by the
observed sequence yT1 = {y1,y2, ...,yT}. The qt’s form the so called hidden sequence
qT1 = {q1, q2, ..., qT}. The probability of observing a sequence yT1 given ω ≡ (pi,A,B) is
P(yT1 |ω) =
∑
qT1
P(y1)P(y1|q1)
T
∏
t=2
P(qt+1|qt)P(yt|qt). (1)
In the learning process, the HMM is fed with a series and adapts its parameters to
produce similar ones. Data feeding can range from offline (all data is fed and parameters
calculated all at once) to online (data is fed by parts and partial calculations are made).
We study a scenario with data generated by a HMM of unknown parameters, an ex-
tension of the student-teacher scenario from neural networks. The performance, as a
function of the number of observations, is given by how far, measured by a suitable cri-
X
iv
:0
70
8.
23
77
v1
[
sta
t.M
L]
1
7 A
ug
20
07
Online Learning in Discrete Hidden Markov
Models
Roberto Alamino∗ and Nestor Caticha†
∗Neural Computing Research Group,
Aston University
Aston Triangle, Birmingham, B4 7ET, United Kingdom
†Instituto de Física,
Universidade de São Paulo,
CP 66318 São Paulo, SP, CEP 05389-970 Brazil
Abstract. We present and analyse three online algorithms for learning in discrete Hidden Markov
Models (HMMs) and compare them with the Baldi-Chauvin Algorithm. Using the Kullback-Leibler
divergence as a measure of generalisation error we draw learning curves in simplified situations.
The performance for learning drifting concepts of one of the presented algorithms is analysed
and compared with the Baldi-Chauvin algorithm in the same situations. A brief discussion about
learning and symmetry breaking based on our results is also presented.
Key Words: HMMs, Online Algorithm, Generalisation Error, Bayesian Algorithm.
INTRODUCTION
Hidden Markov Models (HMMs) [1, 2] are extensively studied machine learning models
for time series with several applications in fields like speech recognition [2], bioinfor-
matics [3, 4] and LDPC codes [5]. They consist of a Markov chain of non-observable
hidden states qt ∈ S, t = 1, ...,T , S = {s1,s2, ...,sn}, with initial probability vector
pii = P(q1 = si) and transition matrix Aij(t) = P(qt+1 = sj |qt = si), i, j = 1, ..,n. At
discrete times t, each qt emits an observed state yt ∈ O, O = {o1, ...,om}, with emis-
sion probability matrix Biα(t) = P(yt = oα|qt = si), i= 1, ...,n, α = 1, ...,m, which are
the actual observations of the time series represented, from time t = 1 to t = T , by the
observed sequence yT1 = {y1,y2, ...,yT}. The qt’s form the so called hidden sequence
qT1 = {q1, q2, ..., qT}. The probability of observing a sequence yT1 given ω ≡ (pi,A,B) is
P(yT1 |ω) =
∑
qT1
P(y1)P(y1|q1)
T
∏
t=2
P(qt+1|qt)P(yt|qt). (1)
In the learning process, the HMM is fed with a series and adapts its parameters to
produce similar ones. Data feeding can range from offline (all data is fed and parameters
calculated all at once) to online (data is fed by parts and partial calculations are made).
We study a scenario with data generated by a HMM of unknown parameters, an ex-
tension of the student-teacher scenario from neural networks. The performance, as a
function of the number of observations, is given by how far, measured by a suitable cri-
Page 2
terion, is the student from the teacher. Here we use the naturally arising Kullback-Leibler
(KL) divergence that, although not accessible in practice since it needs knowledge of the
teacher, is an extension of the idea of generalisation error being very informative.
We propose three algorithms and compare them with the Baldi-Chauvin Algorithm
(BC) [6]: the Baum-Welch Online Algorithm (BWO), an adaptation of the offline Baum-
Welch Reestimation Formulas (BW) [1] and, starting from a Bayesian formulation, an
approximation named Bayesian Online Algorithm (BOnA), that can be simplified again
without noticeable lost of performance to a Mean Posterior Algorithm (MPA). BOnA
and MPA, inspired by Amari [7] and Opper [8], are essentially mean field methods [9]
in which a manifold of prior tractable distributions is introduced and the new datum
leads, through Bayes theorem, to a non-tractable posterior. The key step is to take as the
new prior, not the posterior, but the closest distribution (in some sense) in the manifold.
The paper is organised as follows: first, BWO is introduced and analysed. Next, we
derive BOnA for HMMs and, from it, MPA. We compare MPA and BC for drifting con-
cepts. Then, we discuss learning and symmetry breaking and end with our conclusions.
BAUM-WELCH ONLINE ALGORITHM
The Baum-Welch Online Algorithm (BWO) is an online adaptation of BW where in
each iteration of BW, y becomes yp, the p-th observed sequence. Multiplying the BW
increment by a learning rate ηBW we get the update equations for ω
ωˆp+1 = ωˆp+ηBW ∆ˆωp, (2)
with ∆ˆωp the BW variations for yp. The complexity of BWO is polynomial in n and T .
In figure 1, the HMM learns sequences generated by a teacher with n= 2, m= 3 and
T = 2 for different ηBW . Initial students have matrices with all entries set to the same
value, what we call a symmetric initial student. We took averages over 500 random
teachers and distances are given by the KL-divergence between two HMMs ω1 and ω2
dKL(ω1,ω2)≡
∑
yT1
P(yT1 |ω1) ln
[P(yT1 |ω1)
P(yT1 |ω2)
]
. (3)
We see that after a certain number of sequences the HMM stops learning, which is
particular to the symmetric initial student and disappears for a non-symmetric one.
Denoting the variation of the parameters in BC by ∆, in BW by ∆ˆ, in BWO by ∆˜,
and with γt(i)≡ P(qt = si|yp,ωp), we have to first order in λ
∆pii =
ληBC
n
∆ˆpii =
λ
n
ηBC
ηBW
∆˜pii, (4)
∆Aij =
ληBC
n
[
T−1
∑
t=1
γt(i)
]
∆ˆAij =
λ
n
ηBC
ηBW
[
T−1
∑
t=1
γt(i)
]
∆˜Aij ,
∆Biα =
ληBC
n
[
T
∑
t=1
γt(i)
]
∆ˆBiα =
λ
n
ηBC
ηBW
[
T
∑
t=1
γt(i)
]
∆˜Biα.
(KL) divergence that, although not accessible in practice since it needs knowledge of the
teacher, is an extension of the idea of generalisation error being very informative.
We propose three algorithms and compare them with the Baldi-Chauvin Algorithm
(BC) [6]: the Baum-Welch Online Algorithm (BWO), an adaptation of the offline Baum-
Welch Reestimation Formulas (BW) [1] and, starting from a Bayesian formulation, an
approximation named Bayesian Online Algorithm (BOnA), that can be simplified again
without noticeable lost of performance to a Mean Posterior Algorithm (MPA). BOnA
and MPA, inspired by Amari [7] and Opper [8], are essentially mean field methods [9]
in which a manifold of prior tractable distributions is introduced and the new datum
leads, through Bayes theorem, to a non-tractable posterior. The key step is to take as the
new prior, not the posterior, but the closest distribution (in some sense) in the manifold.
The paper is organised as follows: first, BWO is introduced and analysed. Next, we
derive BOnA for HMMs and, from it, MPA. We compare MPA and BC for drifting con-
cepts. Then, we discuss learning and symmetry breaking and end with our conclusions.
BAUM-WELCH ONLINE ALGORITHM
The Baum-Welch Online Algorithm (BWO) is an online adaptation of BW where in
each iteration of BW, y becomes yp, the p-th observed sequence. Multiplying the BW
increment by a learning rate ηBW we get the update equations for ω
ωˆp+1 = ωˆp+ηBW ∆ˆωp, (2)
with ∆ˆωp the BW variations for yp. The complexity of BWO is polynomial in n and T .
In figure 1, the HMM learns sequences generated by a teacher with n= 2, m= 3 and
T = 2 for different ηBW . Initial students have matrices with all entries set to the same
value, what we call a symmetric initial student. We took averages over 500 random
teachers and distances are given by the KL-divergence between two HMMs ω1 and ω2
dKL(ω1,ω2)≡
∑
yT1
P(yT1 |ω1) ln
[P(yT1 |ω1)
P(yT1 |ω2)
]
. (3)
We see that after a certain number of sequences the HMM stops learning, which is
particular to the symmetric initial student and disappears for a non-symmetric one.
Denoting the variation of the parameters in BC by ∆, in BW by ∆ˆ, in BWO by ∆˜,
and with γt(i)≡ P(qt = si|yp,ωp), we have to first order in λ
∆pii =
ληBC
n
∆ˆpii =
λ
n
ηBC
ηBW
∆˜pii, (4)
∆Aij =
ληBC
n
[
T−1
∑
t=1
γt(i)
]
∆ˆAij =
λ
n
ηBC
ηBW
[
T−1
∑
t=1
γt(i)
]
∆˜Aij ,
∆Biα =
ληBC
n
[
T
∑
t=1
γt(i)
]
∆ˆBiα =
λ
n
ηBC
ηBW
[
T
∑
t=1
γt(i)
]
∆˜Biα.
Page 3
10 100 1000
p
0.1
d
(K
ull
ba
ck
-L
eib
ler
)
0.001
0.01
0.1
FIGURE 1. Log-log curves of BWO for three different ηBW indicated next to the curves.
For ηBW ≈ ληBC/n and small λ, variations in BC are proportional to those in BWO,
but with different effective learning rates for each matrix depending on yp. Simulations
show that actual values are of the same order of approximated ones.
THE BAYESIAN ONLINE ALGORITHM
The Bayesian Online Algorithm (BOnA) [8] uses Bayesian inference to adjust ω in the
HMM using a data set DP = {y1, ...,yP}. For each data, the prior distribution is updated
by Bayes’ theorem. This update takes a prior from a parametric family and transforms it
in a posterior which in general has no longer the same parametric form. The strategy used
by BOnA is then to project the posterior back into the initial parametric family. In order
to achieve this, we minimise the KL-divergence between the posterior and a distribution
in the parametric family. This minimisation will enable us to find the parameters of the
closest parametric distribution by which we will approximate our posterior. The student
HMM ω parameters in each step of the learning process are estimated as the means of
the each projected distribution.
For a parametric family that has the form P (x) ∝ e−
P
iλifi(x), which can be obtained
by the MaxEnt principle where we constrain the averages over P (x) of arbitrary func-
tions fi(x), minimising the KL-divergence turns out to be equivalent to equating the
averages < fi(x) > over P (x) to the average of these functions over the unprojected
posterior (our posterior distribution just after the Bayesian update for the next data).
For HMMs, the vector pi and each i-th row Ai of A and Bi of B are different discrete
distributions which we assume independent in order to write the factorized distribution
P(ω|u)≡P(pi|ρ)
n
∏
i=1
P(Ai|ai)P(Bi|bi), (5)
where u= (ρ,a,b) represents the parameters of the distributions.
As each factor is a distribution over probabilities, the natural choice are the Dirichlet
distributions, which for a N-dimensional variable x is
D(x|u) = Γ(u0)
∏N
i=1Γ(ui)
N
∏
i=1
xui−1i , (6)
p
0.1
d
(K
ull
ba
ck
-L
eib
ler
)
0.001
0.01
0.1
FIGURE 1. Log-log curves of BWO for three different ηBW indicated next to the curves.
For ηBW ≈ ληBC/n and small λ, variations in BC are proportional to those in BWO,
but with different effective learning rates for each matrix depending on yp. Simulations
show that actual values are of the same order of approximated ones.
THE BAYESIAN ONLINE ALGORITHM
The Bayesian Online Algorithm (BOnA) [8] uses Bayesian inference to adjust ω in the
HMM using a data set DP = {y1, ...,yP}. For each data, the prior distribution is updated
by Bayes’ theorem. This update takes a prior from a parametric family and transforms it
in a posterior which in general has no longer the same parametric form. The strategy used
by BOnA is then to project the posterior back into the initial parametric family. In order
to achieve this, we minimise the KL-divergence between the posterior and a distribution
in the parametric family. This minimisation will enable us to find the parameters of the
closest parametric distribution by which we will approximate our posterior. The student
HMM ω parameters in each step of the learning process are estimated as the means of
the each projected distribution.
For a parametric family that has the form P (x) ∝ e−
P
iλifi(x), which can be obtained
by the MaxEnt principle where we constrain the averages over P (x) of arbitrary func-
tions fi(x), minimising the KL-divergence turns out to be equivalent to equating the
averages < fi(x) > over P (x) to the average of these functions over the unprojected
posterior (our posterior distribution just after the Bayesian update for the next data).
For HMMs, the vector pi and each i-th row Ai of A and Bi of B are different discrete
distributions which we assume independent in order to write the factorized distribution
P(ω|u)≡P(pi|ρ)
n
∏
i=1
P(Ai|ai)P(Bi|bi), (5)
where u= (ρ,a,b) represents the parameters of the distributions.
As each factor is a distribution over probabilities, the natural choice are the Dirichlet
distributions, which for a N-dimensional variable x is
D(x|u) = Γ(u0)
∏N
i=1Γ(ui)
N
∏
i=1
xui−1i , (6)
Page 4
where u0 =
∑
iui and Γ is the analytical continuation of the factorial to real numbers.
These can be obtained from MaxEnt with fi(x) = lnxi [13]:
∫
dµD(x) lnxi = αi, dµ≡ δ
(
∑
i
xi−1
)
∏
i
θ(xi)dxi. (7)
The function to be extremized is
L =
∫
dµD lnD+λ
(
∫
dµD−1
)
+
∑
i
λi
(
∫
dµD lnxi−αi
)
, (8)
and with δL/δD = 0 we get the Dirichlet with normalisation eλ+1 and ui = 1−λi.
Each factor distribution is separately projected by equating the average of the loga-
rithms in the original posterior Q and in the projected distributions
ψ(ρi)−ψ
(
∑
j
ρj
)
= 〈lnpii〉Q ≡ µi(ρ), (9)
ψ(aij)−ψ
(
∑
k
aik
)
= 〈lnAij〉Q ≡ µij(a),
ψ(biα)−ψ
(
∑
β
biβ
)
= 〈lnBiα〉Q ≡ µiα(b),
where ψ(x) = d lnΓ(x)/dx is the digamma function. We call a set of N equations
ψ(xi)−ψ
(
∑
j
xj
)
= µi, (10)
with i= 1, ...N a digamma system in the variables xi with coefficients µi.
Let us call P p(ω) the projected distribution after observation of yp, and Qp+1(ω) the
posterior distribution (not projected yet) after yp+1. By Bayes’ theorem,
Qp+1(ω)∝ P p(ω)
∑
qp+1
P(yp+1, qp+1|ω). (11)
The calculation of µ’s in (9) leads to averages over Dirichlets of the form [10]
µi =
〈[
∏
j
xrjj
]
lnxi
〉
=
Γ(u0)
∏
j Γ(uj)
∏
j Γ(uj + rj)
Γ(u0 + r0)
[ψ(ui+ ri)−ψ(u0 + r0)]. (12)
To solve (10), we solve for xi, sum over i with x0 ≡
∑
ixi and find numerically, by
iterating from an arbitrary initial point, the fixed points of the one-dimensional map
xn+10 =
∑
i
ψ−1[µi+ψ(xn0 )], (13)
∑
iui and Γ is the analytical continuation of the factorial to real numbers.
These can be obtained from MaxEnt with fi(x) = lnxi [13]:
∫
dµD(x) lnxi = αi, dµ≡ δ
(
∑
i
xi−1
)
∏
i
θ(xi)dxi. (7)
The function to be extremized is
L =
∫
dµD lnD+λ
(
∫
dµD−1
)
+
∑
i
λi
(
∫
dµD lnxi−αi
)
, (8)
and with δL/δD = 0 we get the Dirichlet with normalisation eλ+1 and ui = 1−λi.
Each factor distribution is separately projected by equating the average of the loga-
rithms in the original posterior Q and in the projected distributions
ψ(ρi)−ψ
(
∑
j
ρj
)
= 〈lnpii〉Q ≡ µi(ρ), (9)
ψ(aij)−ψ
(
∑
k
aik
)
= 〈lnAij〉Q ≡ µij(a),
ψ(biα)−ψ
(
∑
β
biβ
)
= 〈lnBiα〉Q ≡ µiα(b),
where ψ(x) = d lnΓ(x)/dx is the digamma function. We call a set of N equations
ψ(xi)−ψ
(
∑
j
xj
)
= µi, (10)
with i= 1, ...N a digamma system in the variables xi with coefficients µi.
Let us call P p(ω) the projected distribution after observation of yp, and Qp+1(ω) the
posterior distribution (not projected yet) after yp+1. By Bayes’ theorem,
Qp+1(ω)∝ P p(ω)
∑
qp+1
P(yp+1, qp+1|ω). (11)
The calculation of µ’s in (9) leads to averages over Dirichlets of the form [10]
µi =
〈[
∏
j
xrjj
]
lnxi
〉
=
Γ(u0)
∏
j Γ(uj)
∏
j Γ(uj + rj)
Γ(u0 + r0)
[ψ(ui+ ri)−ψ(u0 + r0)]. (12)
To solve (10), we solve for xi, sum over i with x0 ≡
∑
ixi and find numerically, by
iterating from an arbitrary initial point, the fixed points of the one-dimensional map
xn+10 =
∑
i
ψ−1[µi+ψ(xn0 )], (13)
Page 5
1 10 100
p
0.1
d
(K
ull
ba
ck
-L
eib
ler
)
FIGURE 2. Comparison in log-log scale of MPA (dashed line) and BOnA (circles).
where we found a unique solution except for µi ≈ 0, which is rare in most applications.
BOnA has a common problem of Bayesian algorithms: the sum over hidden vari-
ables makes the complexity scales exponentially in T . Also, the calculation of several
digamma functions is very time consuming. In the following, we develop an approxi-
mation that runs faster, although still with exponential complexity in T . This is not a
problem for we can make T constant and the algorithm will scale polynomially in n.
MEAN POSTERIOR APPROXIMATION
The Mean Posterior Approximation (MPA) is a simplification of BOnA inspired in
its results for Gaussians, where we match first and second moments of posterior and
projected distributions. Noting it, instead of minimising dKL we match the mean and
one of the variances of posterior and projected distributions as an approximation, which
gives, with hatted variables for reestimated values [10]
ρˆi = 〈pii〉Q
〈pi1〉Q−〈pi21〉Q
〈pi21〉Q−〈pi1〉
2
Q
, (14)
aˆij = 〈aij〉Q
〈ai1〉Q−〈a2i1〉Q
〈a2i1〉Q−〈ai1〉
2
Q
,
bˆiα = 〈biα〉Q
〈bi1〉Q−〈b2i1〉Q
〈b2i1〉Q−〈bi1〉
2
Q
,
with complexity again of order nT , but with heavily reduced real computational time
making it better for practical applications.
Figure 2 compares MPA and BOnA. The initial difference decreases in time and both
come closer relatively fast. We used n = 2, m = 3 and T = 2 and averaged over 150
random teachers with symmetric initial students. The computational time for BOnA was
340min, and for MPA, 5s in a 1GHz processor. Figure 3a compares MPA to BC and
figure 3b to BWO. In both cases MPA has better generalisation. We used n = 2, m= 3,
T = 2, symmetric initial students and averaged over 500 random teachers.
p
0.1
d
(K
ull
ba
ck
-L
eib
ler
)
FIGURE 2. Comparison in log-log scale of MPA (dashed line) and BOnA (circles).
where we found a unique solution except for µi ≈ 0, which is rare in most applications.
BOnA has a common problem of Bayesian algorithms: the sum over hidden vari-
ables makes the complexity scales exponentially in T . Also, the calculation of several
digamma functions is very time consuming. In the following, we develop an approxi-
mation that runs faster, although still with exponential complexity in T . This is not a
problem for we can make T constant and the algorithm will scale polynomially in n.
MEAN POSTERIOR APPROXIMATION
The Mean Posterior Approximation (MPA) is a simplification of BOnA inspired in
its results for Gaussians, where we match first and second moments of posterior and
projected distributions. Noting it, instead of minimising dKL we match the mean and
one of the variances of posterior and projected distributions as an approximation, which
gives, with hatted variables for reestimated values [10]
ρˆi = 〈pii〉Q
〈pi1〉Q−〈pi21〉Q
〈pi21〉Q−〈pi1〉
2
Q
, (14)
aˆij = 〈aij〉Q
〈ai1〉Q−〈a2i1〉Q
〈a2i1〉Q−〈ai1〉
2
Q
,
bˆiα = 〈biα〉Q
〈bi1〉Q−〈b2i1〉Q
〈b2i1〉Q−〈bi1〉
2
Q
,
with complexity again of order nT , but with heavily reduced real computational time
making it better for practical applications.
Figure 2 compares MPA and BOnA. The initial difference decreases in time and both
come closer relatively fast. We used n = 2, m = 3 and T = 2 and averaged over 150
random teachers with symmetric initial students. The computational time for BOnA was
340min, and for MPA, 5s in a 1GHz processor. Figure 3a compares MPA to BC and
figure 3b to BWO. In both cases MPA has better generalisation. We used n = 2, m= 3,
T = 2, symmetric initial students and averaged over 500 random teachers.
Page 6
10 100 1000 10000
p
0.1
d
(K
ull
ba
ck
-L
eib
ler
)
0.001
0.01
0.1
10 100 1000 10000
p
0.1
d
(K
ull
ba
ck
-L
eib
ler
)
0.005
0.0001
0.0005
a) b)
FIGURE 3. a) Comparison between MPA (dashed) and BC (continuous). Values of λ are indicated next
to the curves. ηBC = 0.5. b) Comparison between MPA (dashed) and BWO (continuous). Values of ηBW
are indicated next to the curves. Both scales are log-log.
0 500 1000 1500 2000 2500 3000
p
0
0.2
0.4
0.6
0.8
d (
Ku
llb
ack
-L
eib
ler
)
0 500 1000 1500 2000 2500 3000
p
0.2
0.4
0.6
0.8
1
d (
Ku
llb
ack
-Le
ibl
er)
a) b)
FIGURE 4. Drifting concepts. Continuous lines correspond to MPA and dashed lines to BC. a) Abrupt
changes at 500 sequences interval. b) Small random changes at each new sequence.
LEARNING DRIFTING CONCEPTS
We tested BC and MPA for changing teachers. In figure 4a, it changes at random after
each 500 sequences (λ = 0.01, ηBC = 10.0). In figure 4b, each time a sequence is
observed, a small random quantity is added to the teacher. Both have n = 2, m = 3
and are averaged over 200 runs.
Figure 4b shows that BC adapts better, but is not fully adaptive and we do not know
how to modify it. MPA instead derives from Bayesian principles and we can guess the
problem by analogy with similar Bayesian algorithms [12]: variances decrease in the
process as in the perceptron, where they are the learning rates, explaining the memory
effect difficulting the learning after changes. Although not proved yet, we expect the
same relationship in MPA, which can be used to improve performance.
LEARNING AND SYMMETRY BREAKING
Learning from symmetric initial students requires that the parameters separate from each
other in some point, which depends on the algorithm and is an important feature in online
p
0.1
d
(K
ull
ba
ck
-L
eib
ler
)
0.001
0.01
0.1
10 100 1000 10000
p
0.1
d
(K
ull
ba
ck
-L
eib
ler
)
0.005
0.0001
0.0005
a) b)
FIGURE 3. a) Comparison between MPA (dashed) and BC (continuous). Values of λ are indicated next
to the curves. ηBC = 0.5. b) Comparison between MPA (dashed) and BWO (continuous). Values of ηBW
are indicated next to the curves. Both scales are log-log.
0 500 1000 1500 2000 2500 3000
p
0
0.2
0.4
0.6
0.8
d (
Ku
llb
ack
-L
eib
ler
)
0 500 1000 1500 2000 2500 3000
p
0.2
0.4
0.6
0.8
1
d (
Ku
llb
ack
-Le
ibl
er)
a) b)
FIGURE 4. Drifting concepts. Continuous lines correspond to MPA and dashed lines to BC. a) Abrupt
changes at 500 sequences interval. b) Small random changes at each new sequence.
LEARNING DRIFTING CONCEPTS
We tested BC and MPA for changing teachers. In figure 4a, it changes at random after
each 500 sequences (λ = 0.01, ηBC = 10.0). In figure 4b, each time a sequence is
observed, a small random quantity is added to the teacher. Both have n = 2, m = 3
and are averaged over 200 runs.
Figure 4b shows that BC adapts better, but is not fully adaptive and we do not know
how to modify it. MPA instead derives from Bayesian principles and we can guess the
problem by analogy with similar Bayesian algorithms [12]: variances decrease in the
process as in the perceptron, where they are the learning rates, explaining the memory
effect difficulting the learning after changes. Although not proved yet, we expect the
same relationship in MPA, which can be used to improve performance.
LEARNING AND SYMMETRY BREAKING
Learning from symmetric initial students requires that the parameters separate from each
other in some point, which depends on the algorithm and is an important feature in online
Page 7
0 1000 2000 3000
p
0
0.5
1
1.5
2
2.5
3
d
(K
ul
lb
ac
k-
Le
ib
ler
)
0 1000 2000 3000
p
0
0.2
0.4
0.6
0.8
1
In
iti
al
P
ro
ba
bi
lit
ie
s
0 1000 2000 3000
p
0
0.2
0.4
0.6
0.8
1
T
ra
ns
iti
on
M
at
ri
x
0 1000 2000 3000
p
0
0.2
0.4
0.6
0.8
1
E
m
is
si
on
M
at
ri
x
0 1000 2000 3000
p
0
0.5
1
1.5
2
2.5
3
d
(K
ul
lb
ac
k-
Le
ib
ler
)
0 1000 2000 3000
p
0
0.2
0.4
0.6
0.8
1
In
iti
al
p
ro
ba
bi
lit
ie
s
0 1000 2000 3000
p
0
0.2
0.4
0.6
0.8
1
T
ra
ns
iti
on
M
at
ri
x
0 1000 2000 3000
p
0
0.2
0.4
0.6
0.8
1
E
m
is
si
on
M
at
ri
x
a) b)
FIGURE 5. KL-divergence and student’s parameters for a) BC and b) MPA.
algorithms [11], breaking the symmetry with a sharp decrease in the generalisation error.
Instead of taking averages to smooth abrupt changes, here we draw curves for only
one teacher, rendering them visible. Flat lines before a symmetry breaking are called
plateaux and occur when it is difficult to break the symmetry.
Figure 5a shows BC (λ= 0.01, ηBC = 1.0) with two abrupt changes: in the beginning
and after 1000 sequences. pi and A only break the symmetry in the second point, and B
in both. Figure 5b shows that in MPA the second change is stronger and the symmetry
breaking affects both B and A. Figure 6 shows BWO with ηBW = 0.01 where only B is
affected. The more symmetries are broken, the best the generalisation of the algorithm.
In all simulations we set n= 2, m= 3 and T = 2 with a teacher HMM given by
pi =
(
1
0
)
, A =
(
0 1
1 0
)
, B =
(
1 0 0
0 0 1
)
. (15)
CONCLUSIONS
We proposed and analysed three learning algorithms for HMMs: Baum-Welch On-
line (BWO), Bayesian Online Algorithm (BOnA) and Mean Posterior Approximation
(MPA). We showed the superior performance of MPA for static teachers, but the Baldi-
Chauvin (BC) algorithm is better for drifting concepts, although the Bayesian nature of
MPA suggests how to fix it. The results seem to be confirmed by initial tests on real data.
The importance of symmetry breaking in learning processes is presented here in a
brief discussion where the phenomenon is shown to occur in our models.
ACKNOWLEDGEMENTS
We would like to thank Evaldo Oliveira, Manfred Opper and Lehel Csato for useful
discussions. This work was made part in the University of São Paulo with financial
support of FAPESP and part in the Aston University with support of Evergrow Project.
p
0
0.5
1
1.5
2
2.5
3
d
(K
ul
lb
ac
k-
Le
ib
ler
)
0 1000 2000 3000
p
0
0.2
0.4
0.6
0.8
1
In
iti
al
P
ro
ba
bi
lit
ie
s
0 1000 2000 3000
p
0
0.2
0.4
0.6
0.8
1
T
ra
ns
iti
on
M
at
ri
x
0 1000 2000 3000
p
0
0.2
0.4
0.6
0.8
1
E
m
is
si
on
M
at
ri
x
0 1000 2000 3000
p
0
0.5
1
1.5
2
2.5
3
d
(K
ul
lb
ac
k-
Le
ib
ler
)
0 1000 2000 3000
p
0
0.2
0.4
0.6
0.8
1
In
iti
al
p
ro
ba
bi
lit
ie
s
0 1000 2000 3000
p
0
0.2
0.4
0.6
0.8
1
T
ra
ns
iti
on
M
at
ri
x
0 1000 2000 3000
p
0
0.2
0.4
0.6
0.8
1
E
m
is
si
on
M
at
ri
x
a) b)
FIGURE 5. KL-divergence and student’s parameters for a) BC and b) MPA.
algorithms [11], breaking the symmetry with a sharp decrease in the generalisation error.
Instead of taking averages to smooth abrupt changes, here we draw curves for only
one teacher, rendering them visible. Flat lines before a symmetry breaking are called
plateaux and occur when it is difficult to break the symmetry.
Figure 5a shows BC (λ= 0.01, ηBC = 1.0) with two abrupt changes: in the beginning
and after 1000 sequences. pi and A only break the symmetry in the second point, and B
in both. Figure 5b shows that in MPA the second change is stronger and the symmetry
breaking affects both B and A. Figure 6 shows BWO with ηBW = 0.01 where only B is
affected. The more symmetries are broken, the best the generalisation of the algorithm.
In all simulations we set n= 2, m= 3 and T = 2 with a teacher HMM given by
pi =
(
1
0
)
, A =
(
0 1
1 0
)
, B =
(
1 0 0
0 0 1
)
. (15)
CONCLUSIONS
We proposed and analysed three learning algorithms for HMMs: Baum-Welch On-
line (BWO), Bayesian Online Algorithm (BOnA) and Mean Posterior Approximation
(MPA). We showed the superior performance of MPA for static teachers, but the Baldi-
Chauvin (BC) algorithm is better for drifting concepts, although the Bayesian nature of
MPA suggests how to fix it. The results seem to be confirmed by initial tests on real data.
The importance of symmetry breaking in learning processes is presented here in a
brief discussion where the phenomenon is shown to occur in our models.
ACKNOWLEDGEMENTS
We would like to thank Evaldo Oliveira, Manfred Opper and Lehel Csato for useful
discussions. This work was made part in the University of São Paulo with financial
support of FAPESP and part in the Aston University with support of Evergrow Project.
Sign up today - FREE
Mendeley saves you time finding and organizing research. Learn more
- All your research in one place
- Add and import papers easily
- Access it anywhere, anytime
Start using Mendeley in seconds!
Readership Statistics
1 Reader on Mendeley
by Discipline
100% Physics
by Academic Status
100% Researcher (at an Academic Institution)
by Country
100% United Kingdom


