Sign up & Download
Sign in

Bayes Consistent Classification of EEG Data by Approximate Marginalization

by Peter Sykacek, Iead Rezek, Stephen J Roberts
Engineering (2005)

Abstract

This chapter proposes a generative model and a Bayesian learning scheme for a classifier that takes uncertainty at all levels of inference into account. Classifier inputs will be uncertain if they are estimated, for example using a preprocessing method. Classical approaches would neglect the uncertainties associated with these variables. However, the decisions thus found are not consistent with Bayesian theory. In order to conform with the axioms underlying the Bayesian framework, we must incorporate all uncertainties into the decisions. The model we use for classification treats input variables resulting from preprocessing as latent variables. The proposed algorithms for learning and prediction fuse information from different sensors spatially and across time according to its certainty. In order to get a computationally tractable method, both feature and model uncertainty of the preprocessing stage are obtained in closed form. Classification requires integration over this latent feature space, which in this chapter is done approximately by a Markov chain Monte Carlo (MCMC) method. Our approach is applied to classifying cognitive states from EEG segments and to classification of sleep spindles

Cite this document (BETA)

Page 1
hidden

Bayes Consistent Classification of EEG Data by Approximate Marginalization

1Bayes Consistent Classification of EEG Data by
Approximate Marginalisation
P. Sykacek, I. Rezek and S. Roberts
University of Oxford, Dept. of Engineering Science, Oxford, U.K.
fpsyk,irezek,sjrobg@robots.ox.ac.uk
Summary. This chapter proposes a generative model and a Bayesian learning scheme for a
classifier that takes uncertainty at all levels of inference into account. Classifier inputs will
be uncertain if they are estimated, for example using a preprocessing method. Classical ap-
proaches would neglect the uncertainties associated with these variables. However, the deci-
sions thus found are not consistent with Bayesian theory. In order to conform with the axioms
underlying the Bayesian framework, we must incorporate all uncertainties into the decisions.
The model we use for classification treats input variables resulting from preprocessing as latent
variables. The proposed algorithms for learning and prediction fuse information from different
sensors spatially and across time according to its certainty. In order to get a computationally
tractable method, both feature and model uncertainty of the preprocessing stage are obtained
in closed form. Classification requires integration over this latent feature space, which in this
chapter is done approximately by a Markov chain Monte Carlo (MCMC) method. Our ap-
proach is applied to classifying cognitive states from EEG segments and to classification of
sleep spindles.
1.1 Introduction
Once we opt for decision analysis within a Bayesian framework, we must do this
throughout the entire process. The approach studied in this chapter assumes that we
are given some segments of time series, each labelled as being in one of K possible
states. Such a setting is typically found in biomedical diagnosis, where successive
segments of bio signals have to be classified. Usually, the number of samples within
the segments is large. However, the information contained in a segment is often rep-
resented by a much smaller number of features. Such problems are typically solved
by splitting the whole problem into two parts: a preprocessing method that extracts
some features and a classifier.
Using Bayesian inference for each part is common practice. For preprocessing it
has been considered among many others by [1] and [2]. A review of recent devel-
opments is provided by [3]. The situation in post-processing is similar: see e.g. [4],
[5] or [6] for a summary. We could follow these lines and apply Bayesian techniques
Page 2
hidden
2 P. Sykacek, I. Rezek and S. Roberts
to both stages of the decision process separately. However this would mean estab-
lishing a non probabilistic link between preprocessing and classification. Such a link
does not allow feature or model uncertainty, as found by Bayesian preprocessing, to
have an influence on the beliefs reported by the classifier. In a Bayesian sense this
is equivalent to approximating the a-posteriori probability density over feature vari-
ables by a delta peak and ignoring the uncertainty in the stochastic model used for
preprocessing.
ρ
X
t
Fig. 1.1. A directed acyclic graph for the hierarchical model. We used t to denote the unknown
state variable of interest, ½ is a latent (unobserved) variable representing features from prepro-
cessing and X denotes a segment of a time series. Circles indicate latent variables, whereas
the square indicates that X is an observed quantity.
A model that allows for a probabilistic link between preprocessing and post-
processing has to have a structure similar to the directed acyclic graph (DAG) shown
in Figure 1.1. The key idea is to treat the features obtained by preprocessing as la-
tent variables. The link between ½ and X represents preprocessing, which has to be
carried out in a Bayesian setting. The model used in this chapter for preprocessing
is a lattice filter representation of an auto regressive (AR) model. The coefficients of
this model are the well known reflection coefficients as commonly used in speech
processing [7]. We give a short summary of our derivation of “Bayesian reflection
coefficients” in section 1.2.
So far approaches using a probabilistic structure, similar to the DAG in Figure
1.1, have focused on marginalising out input uncertainties. The main emphasis in
[8] and [9] is to derive a predictive distribution for regression models where input
uncertainty is taken into account. In a Bayesian sense this approach is the only way
of consistent reasoning if perfect knowledge of inputs is not available. We provide a
similar analysis for classification problems. However our approach goes further:
Page 3
hidden
1 Bayes Consistent Classification of EEG Data by Approximate Marginalisation 3
² We allow for different uncertainties at different inputs. This leads to predictions
that are dominated by inputs that are more certain. Using generative models to
link t and ½, the model provides information about the true input values of less
certain sensors. Using several sensors will lead to “Bayesian sensor fusion”. The
use of a generative model gives us the additional benefit that we can use unla-
belled data for parameter inference as well.
² Another extension of the work reported in [8] and [9] is that we take the model
uncertainty, inherent to all preprocessing methods, into account. Model uncer-
tainty is represented by the a-posteriori probability of the model conditional on
the corresponding data segment, X .
In the following sections of the chapter, we derive a Bayesian solution for lattice
filter AR-models that captures both feature and model uncertainty. We show that
marginalising out latent variables in a static DAG indeed performs sensor fusion
such that predictions depend more on certain information. Equipped with this theo-
retical insight, we propose a DAG and inference scheme that is well suited for time
series classification. We assume a first order Markov dependency among class la-
bels of interest and derive MCMC updates that sample from the joint probability
distribution over latent variables and model coefficients. The experimental section of
this chapter provides a synthetic experiment to illustrate the idea and provides then
a quantitative evaluation using the proposed method for sleep spindle classification
and as a classifier in a brain computer interface experiment.
1.2 Bayesian lattice filter
The lattice filter is a representation of an auto regressive (AR) model [7]. Its param-
eters are the so called refection coefficients, below denoted as rm. Equation (1.1)
shows an AR model, where x[t] is the sample of a time series at time t, a¹ is the ¹-th
AR coefficient of the m-th order AR model and e[t] the forward prediction error,
which we assume to be independently identically distributed (i.i.d.) Gaussian with
zero mean and precision (inverse variance) ¯.
x[t] = ¡
mX
¹=1
x[t¡ ¹]a¹ + e[t] (1.1)
The autocovariance function, which is the sufficient statistic of a stationary Gaus-
sian noise AR model, is invariant to the direction of time [7]. The corresponding
backward prediction model shares thus the same parameters.
x[t¡m] = ¡
mX
¹=1
x[t+ 1¡ ¹]a¹ + b[t] (1.2)
To derive Bayesian reflection coefficients, we assume that N data points, X =
fx[1]; ::; x[N ]g, are available to estimate the model. We denote the m-th order
Page 4
hidden
4 P. Sykacek, I. Rezek and S. Roberts
AR coefficients as 'm, summarise the forward prediction errors e[t] as ²m and all
bacward prediction errors b[t] as bm. Applying the backward time shift operator q¡1
on bm, linear regression onto the forward prediction errors, ²m, defines the likeli-
hood of the m-th order AR coefficients and the m+1-th reflection coefficient rm+1.
p(Xjrm+1; ¯;'m) = (2¼)¡0:5N¯0:5N (1.3)
£ exp
³
¡0:5 ¡²m + rm+1q¡1bm
¢T ¡²m + rm+1q¡1bm
¢´
Since we use this model to represent segments of biological time series, we know
with certainty that the underlying AR-process must be stable. As the reflection coef-
ficients of a stable model have to be within the interval [¡1; 1], we may use a flat prior
within this range. Thus the uninformative proper prior over reflection coefficients is:
p(rm+1) = 0:5. Another parameter which appears in the likelihood expression of
the lattice filter model is the noise level ¯. The noise level is a scale parameter and
following [10] we use the Jeffreys’ prior, p(¯) = 1=¯.
In order to obtain the posterior distribution over the m+1-th reflection coefficient
rm+1, the Bayesian paradigm requires us to treat both the m-th order AR coefficients,
'm, and the noise level, ¯, as nuisance parameters and integrate them out. We may
simplify calculations considerably by assuming a sharply peaked posterior over 'm.
This results in an order recursive estimation, where we condition on the forward and
backward prediction errors that result from the most probable coefficients.
p(rm+1jX ) = 1p2¼s exp
µ
¡ 12s2 (rm+1 ¡ r^m+1)
2

(1.4)
as the posterior distribution of the m+ 1-th order reflection coefficient, in which
r^m+1 = ¡²
T
mq¡1bm
bTmbm
(1.5)
represents the most probable value of the reflection coefficient and,
s2 = 1¡ (r^m+1)
2
(N ¡ 1) ; (1.6)
the corresponding variance. We finally need to update the forward and backward
prediction errors, to account for the m+ 1-st reflection coefficient.
²m+1 = ²m + r^m+1q¡1rm (1.7)
rm+1 = q¡1rm + r^m+1²m
Our analysis will also consider the uncertainty about the lattice filter model, by al-
lowing for two explanations of the dataX . We assume that the data is either modelled
by an M -th order lattice filter or that it is white noise. This uncertainty is captured
by the posterior probability P (IM jX ), where IM ´ 1 denotes the lattice filter ex-
planation and IM ´ 0 denotes the white noise case. The Bayesian model evidence
(marginal likelihood) of the M -th order lattice filter model, IM ´ 1, is then
Page 5
hidden
1 Bayes Consistent Classification of EEG Data by Approximate Marginalisation 5
p(XjIM ´ 1) = 0:5¼¡N2 ¡ (N2 )
p
2¼s (1.8)
£ ¡(²M¡1 + r^MbM¡1)T(²M¡1 + r^MbM¡1)
¢¡N2 :
Comparing the M -stage lattice filter model with the Bayesian evidence of explaining
the data as white noise gives a measure of model uncertainty if we allow for these two
explanations only. We obtain the evidence of a white noise explanation by integrating
out the noise level ¯. We indicate the corresponding model by I0 and get:
p(XjIM ´ 0) = ¼¡N2 ¡ (N2 )
¡²T0 ²0
¢¡N2 ; (1.9)
where ²0 refers to the 0-order model residuals (the original time series). Using equal
priors for both models, the a-posteriori probability of the m-th reflection coefficient
compared to a white noise explanation is:
P (IM ´ 1jX ) = p(XjIM ´ 1)p(XjIM ´ 1) + p(XjIM ´ 0) : (1.10)
We use P (IM jX ) to measure the uncertainty of modelling the data by an M -th order
lattice filter, when we allow for a white noise explanation as the other possibility. Our
motivation for this two-model hypothesis is our intent to model electroencephalog-
raphy (EEG) data. EEG is often contaminated by muscle artefacts, which result in
measurements that are very similar to white noise.
Equations (1.4), (1.5) and (1.6) imply that the underlying processes are dynami-
cally stable systems since they do not allow reflection coefficients to be outside the
interval [¡1; 1]. Since we intend to model the distributions in the latent feature space
by a mixture of Gaussian distribution, we have to resolve a mismatch between the
domain of the posterior over M -th order lattice filter coefficients, r 2 [¡1; 1]M and
the domain of the Gaussian mixture model, which requires ½ 2 RM . To resolve the
problem, we follow [11] and map using Fisher’s z-transform
½ = arctanh(r); (1.11)
which we use henceforth as representation of the feature space. Using error analysis
[12], we can approximate the posterior distribution over the transformed reflection
coefficients ½m by
p(½mjX ) = 1p2¼
p
¸ exp
µ
¡12¸(½m ¡ ½^m)
2

(1.12)
where ½^m = arctanh(r^m)
and ¸ = (N ¡ 1)(1¡ r^2m):
1.3 Spatial fusion
In our case the input uncertainty is a result of the limited accuracy of feature esti-
mates as obtained by preprocessing. If we know that the model used in preprocessing
Page 6
hidden
6 P. Sykacek, I. Rezek and S. Roberts
ρ ρ
a
I
Xb
a ba b
t
X
I
Fig. 1.2. A directed acyclic graph that captures both parameter uncertainty as well as model
uncertainty in preprocessing. We used t to denote the unknown state variable of interest, ½a
and ½b are two latent variables representing the true values of features estimated by prepro-
cessing. Both Xa and Xb are the corresponding segments of a time series. The latent binary
indicator variables Ia and Ib control the dependency of the data segments on the latent vari-
ables.
is the true model that generated the time series X , the approach taken by [8] and [9]
would be sufficient. As however was argued in section 1.2, we do not know whether
the model used during preprocessing is the true one and we have to consider feature
as well as model uncertainty. Thus inference has to be carried out with a DAG that
allows for both feature and model uncertainty. Sensor fusion will only take place
if different sensors are conditionally dependent on the state of interest. In order to
take this into account, we have to extend the DAG structure as shown in Figure 1.2.
We introduce binary indicator variables Ia and Ib that control the conditional depen-
dency of the observed data segments Xa and Xb on the latent variables ½a and ½b
respectively.
The DAG in Figure 1.2 illustrates the dependency between a state variable t,
latent variables ½a and ½b and the corresponding segments of a time series Xa and
Xb. The dependency is controlled by two model indicator variables, Ia and Ib, one
for each sensor. Both models, Ia and Ib, representing particular stages of a lattice
filter, are probable explanations of the corresponding time series Xa and Xb.
During preprocessing we allow for two possible explanations of each segment
of the time series. With probability P (IajXa), the latent variable ½a is conditional
on both the time series Xa and the state variable t. However, with probability 1 ¡
P (IajXa), Xa is pure white noise and does not require ½a. In this case, we have to
condition on t only. Loosely speaking, we deal with a problem of “probably missing
Page 7
hidden
1 Bayes Consistent Classification of EEG Data by Approximate Marginalisation 7
values”. However this interpretation of Figure 1.2 tells us that we need to use the
following definition of the conditional probability density of Xa
p(Xaj½a; Ia) =
(
p(Xaj½a; Ia ´ 1)
p(XajIa ´ 0)
: (1.13)
Depending on the value of Ia, we introduce a conditional independence between the
data, Xa and the true feature value ½a which is not seen in the DAG in Figure 1.2.
When changing indices, we get the same statements for the latent variable ½b.
As previously mentioned, we want to predict the belief of state t conditional on
all available information. This is the observed time series Xa and Xb as well as all
training data D observed so far. Although we will not state this explicitly, all beliefs
are also conditional on training data D. This is a requirement that humans apply
intuitively: Whenever a clinical expert wants to monitor a new recording of some
biological time series they have prior expectations about the range they should use.
In order to provide deeper insight into the model illustrated in Figure 1.2, we
will infer both the a-posteriori probability P (tjXa;Xb) and the a-posteriori prob-
ability density p(½ajXa;Xb). Using Bayes theorem, we express P (t)p(½ajt) as
p(½a)P (tj½a) and P (t)p(½bjt) as p(½b)P (tj½b). Conditioning on Xa and Xb, the
DAG in Figure 1.2 implies:
p(t; ½a; Ia; ½b; IbjXa;Xb) = (1.14)
P (tj½a; Ia)P (tj½b; Ib)p(½ajIa;Xa)p(IajXa)p(½bjIb;Xb)p(IbjXb)p(Xa)p(Xb)
P (t)p(Xa;Xb) :
As none of the variables on the left side of the conditioning bar in (1.14) are observed,
we use marginal inference. We obtain P (tjXa;Xb) by plugging (1.13) into (1.14) and
integrating out ½a, Ia, ½b and Ib
P (tjXa;Xb) = p(Xa)p(Xb)p(Xa;Xb)P (t) (1.15)
£
X
Ia
Z
½a
P (tj½a; Ia)p(½ajIa;Xa)P (IajXa)d½a
£
X
Ib
Z
½b
P (tj½b; Ib)p(½bjIb;Xb)P (IbjXb)d½b:
The naı¨ve Bayes structure of the DAG in Figure 1.2 allows us to expand P (tj½a; Ia) =
P (tj½a)P (tjIa)
P (t) , which equivalently holds for sensor b. The influence of feature un-
certainty on the probability of the state t is most easily seen if we assume perfect
knowledge of the preprocessing model:
P (tjXa;Xb) / 1P (t)
Z
½a
P (tj½a)p(½ajXa)d½a
Z
½b
P (tj½b)p(½bjXb)d½b:
This expression shows that that the probability P (tjXa;Xb) is dominated by Xa, if
½a is, under its posterior, more informative about class t1. At a first glance we might
1 The same argument holds for the other sensor Xb.
Page 8
hidden
8 P. Sykacek, I. Rezek and S. Roberts
expect that higher accuracy should dominate and this seems counter intuitive. How-
ever, perfect knowledge does not help if the extracted feature is equally likely for
different classes. Thus although known with less precision, a variable might domi-
nate if it allows on average for better discrimination. The other extreme is that we
know that Xa and Xb are white noise. In this case we get P (Ia ´ 0jXa) = 1 and
P (Ib ´ 0jXb) = 1 and thus
P (tjXa;Xb) / P (tjIa)P (tjIb)P (t) :
This is a very intuitive result, as without parameters ½a and ½b the model indicators
Ia and Ib are the only information about class t.
If we are interested in the a-posteriori distribution over one of the latent spaces,
say (Ia; ½a), we have to apply similar manipulations. Assuming that the continuous
parameter ½a is a point in RM , (Ia; ½a) lies in Ia ´ 1£ ½a 2 RM
S Ia ´ 0£ ½a 2
R0, where the dimension of ½a in the second case is zero. The posterior is then
p(½a; IajXa;Xb) = p(Xa)p(Xb)p(Xa;Xb) (1.16)
£
X
t
"
P (tj½a; Ia)p(½ajIa;Xa)P (IajXa)
P (t)
£
X
Ib
Z
½a
P (tj½b; Ib)p(½bjIb;Xb)P (IbjXb)d½b
#
:
By exchanging marginalisation over (Ib; ½b) with marginalisation over (Ia; ½a), we
obtain the corresponding a-posteriori density over the feature space (Ib; ½b) .
We should now point out that we did not follow an exact Bayesian scheme, since
this requires us to use all available information. The derivations presented so far de-
viate slightly, since the probabilistic dependency between t and the latent variables
½a and ½b allows, via the conditional distributions p(½ajt) and p(½bjt), for informa-
tion flow between the two sensors. Hence irrespective whether we know t or not, we
will have information about ½a and ½b that goes beyond the “stability prior” adopted
in the last section. However, the problem is that using this information, we can not
derive any expressions analytically as was done there. Instead we have to resort to
MCMC techniques for the entire analysis including preprocessing [11]. Even with
modern computers, this is still a very time consuming procedure. A qualitative argu-
ment shows that this approximation will in practical situations not effect the results
dramatically. The usual settings in preprocessing will lead to a-posteriori densities
over coefficients that, compared with the priors, are sharply peaked around the most
probable value2. In this case using either a uniform prior in the range [¡1; 1], or the
more informative prior provided via t does not make a big difference.
2 As can be seen from Equation (1.6), this is just a matter of the number of samples used to
estimate the feature values.
Page 9
hidden
1 Bayes Consistent Classification of EEG Data by Approximate Marginalisation 9
As a last step it remains to provide an abstract formulation of an inference pro-
cedure of model coefficients for the DAG shown in Figure 1.2. In order to make life
easier, we assume for now that we are given only labelled samples. A generalisation
to include also unlabelled samples is provided in the next section. Assuming to know
the true values of features and states, conventional model inference would condition
on training data, A = f½a;i8ig, B = f½b;i8ig and T = fti8ig. Denoting all model
coefficients jointly by w, inference would lead to p(wjA;B; T ). In our setting the
Ia;i’s, Ib;i’s, ½a;i’s and ½b;i’s are latent variables and we can not condition on them.
Instead we would like to condition on the corresponding Xa;i’s and Xb;i’s. Such con-
ditioning can not be done directly. Assuming independence of observations, the DAG
in Figure 1.2 implies the likelihood to be
p(T ;A; IA;XA;B; IB ;XB jw) = (1.17)
=
Y
i
p(ti; ½a;i; Ia;i;Xa;i; ½b;i; Ib;iXb;ijw)
=
Y
i
"
P (tij½a;i; Ia;i;w)P (tij½b;i; Ib;i;w)
P (tijw)
£p(½a;ijIa;i;Xa;i)p(Ia;ijXa;i)p(½b;ijIb;i;Xb;i)p(Ib;ijXb;i)
#
;
where we use IA = fIa;i8ig, XA = fXa;i8ig, IB = fIb;i8ig and XB = fXb;i8ig
and made the dependency on model parameters w explicit. We finally get the poste-
rior over model coefficients w by multiplication with a prior p(w) and marginalising
out all ½a;i, Ia;i, ½b;i and Ib;i.
p(wjT ;XA;XB) / p(w)
Y
i
"
1
P (ti) (1.18)
£
X
Ia;i
Z
½a;i
P (tij½a;i; Ia;i;w)P (Ia;ijXa;i)p(½a;ijIa;i;Xa;i)d½a;i
£
X
Ib;i
Z
½b;i
P (tij½b;i; Ib;i;w)P (Ib;ijXb;i)p(½b;ijIb;i;Xb;i)d½b;i
#
Expression (1.18) is just proportional to the posterior since we still need to normalize.
In general neither the integrals nor the normalisation constants in (1.15), (1.16) or
(1.18) will be analytically tractable. In all practical problems we will have to apply
MCMC methods.
The model in this section illustrates that marginalising out variables and model
uncertainty leads to predictions that are dominated by information that allows for
reliable discrimination. However the DAG is not optimally suited for classification of
biomedical time series. As our interest is classification of adjacent segments of a time
series, we can improve on the DAG in Figure 1.2 by allowing for additional temporal
dependencies. In the following section we propose a model which allows for a first
Page 10
hidden
10 P. Sykacek, I. Rezek and S. Roberts
order Markov dependency among additional discrete latent variables, di, and thus for
information flow across time. We propose Bayesian inference by sampling from the
joint probability density over latent variables and model coefficients.
1.4 Spatio-temporal fusion
A DAG structure that allows spatio-temporal sensor fusion is obtained by imposing a
conditional dependency among adjacent state variables and to spatially different sen-
sors. Sensor fusion is then achieved very naturally by treating this DAG within the
Bayesian framework. Bayesian preprocessing will assess both model and coefficient
uncertainties. Marginalising out these uncertainties will lead to beliefs about states
that are less affected by less reliable information. Such an approach can go even fur-
ther: it allows one to infer the expected feature values which for unreliable segments
will differ from the values obtained from preprocessing alone. This requires that the
architecture contains a generative model. Thus, as a by-product, we will be able to
use labelled as well as unlabelled data during model inference.
1.4.1 A simple DAG structure
This subsection proposes a DAG structure that allows spatio-temporal sensor fusion.
Measured in terms of model complexity (number of free parameters), we aim at a
simple solution. Assuming that we want to solve a classification task, we regard the
class labels as the state of interest. In order to exploit temporal correlations among
successive classifications we propose a model that has the latent indicator variables
di and the class labels ti connected to form a first order Markov chain. The idea
behind linking the class labels ti is that in cases where one of the state values of
di corresponds to a high uncertainty state (i.e. with similar probabilities P (tijdi)
for all ti) this gives us additional information about the class label. Furthermore,
we assume conditional independence between all latent variables depending on each
state variable. Figure 1.3 shows a DAG structure imposed by these assumptions.
Training data consists of labelled as well as unlabelled segments of a time series.
Unobserved states are represented by circles and the observed states by squares. We
have already decided about the latent variables ½i;j and indicator variables Ii;j : they
are stages in an AR-lattice filter and Xi;j are the corresponding segments of a time
series. It remains to decide about the generative model between the state variables di
and the latent variables ½i;j and Ii;j . In our case the ½i;j’s are continuous variables
and the generative model will be implemented as diagonal Gaussian distributions,
p(½i;j jdi) = N (½i;j j¹j;d;¸j;d); (1.19)
with mean ¹j;d and precision ¸j;d. The model indicator Ii;j is given a multinomial-1
distribution
p(Ii;j jdi) = Mn(Ii;j j¦j;d); (1.20)
Page 11
hidden
1 Bayes Consistent Classification of EEG Data by Approximate Marginalisation 11
i+1,jρi,j Ii,j
X i,j
i
d i
t i+1t
i+1d
X i+1,j
Ii+1,jρ
Fig. 1.3. A directed acyclic graph for spatio-temporal sensor fusion: The DAG assumes a
first order Markov dependency among states di and conditional on these, independence of the
latent variables ½i;j . We used ti to denote the unknown state variables of interest (i.e. the class
labels of the segments under consideration), which are linked with a second Markov chain.
The idea behind this second chain is that it allows us to resolve ambiguity which might be
caused by a high uncertainty state. Both ½i;j and Ii;j are latent variables, representing feature
and model uncertainty from preprocessing. Finally Xi;j denote the corresponding segments of
a time series.
where¦j;d are the binary probabilities observing either one of the two preprocessing
models given state di. Model inference will be based on a Markov chain Monte Carlo
(MCMC) method. We thus need to consider the likelihood function, design a DAG
that shows the relations during inference, specify convenient priors and, as a final
step, formulate the MCMC updates.
1.4.2 A likelihood function for sequence models
As already mentioned, we are interested in a fully Bayesian treatment of the model.
This can be done as soon as we are able to formulate a normalized likelihood function
and priors. We are dealing with a sequence model, where the likelihood is usually
(see [13], pp. 150) formulated via paths that are possible sequences of latent states:
P (D;¦d;¦tjw) = P (d1)
Y
j
p(½1;j ; I1;j jd1)P (t1jd1) (1.21)
£
NY
i=2
0
@P (dijdi¡1)P (tijdi; ti¡1)
Y
j
p(½i;j ; Ii;j jti)
1
A :
Page 12
hidden
12 P. Sykacek, I. Rezek and S. Roberts
In (1.21) we used D to denote a realisation of all latent variables ½i;j and model
indicators Ii;j . A sequence of latent variables di is denoted as ¦d and a sequence of
labels is denoted by ¦t. Note that for the second sequence not all paths are possible
since we have to visit all given labels. We thus obtain the likelihood
P (D; T jw) =
X
¦d;¦t
P (D;¦d;¦tjw); (1.22)
where T denotes the observed class labels. If several independent sequences are
used to infer model coefficients, the overall likelihood is the product of several ex-
pressions like (1.22). In order to obtain a final expression of the likelihood of model
coefficients we have to plug Equations (1.19) and (1.20) into Equation (1.22). It is
evident that the resulting likelihood is highly nonlinear and parameter inference had
to be done by carrying out Metropolis-Hastings updates. The conventional way to
maximize such likelihood functions is to apply the expectation maximisation (EM)
algorithm [14] which was introduced for HMMs in [15]. The sampling algorithm
proposed in the next subsection uses similar ideas. We use both Gibbs updates and
Metropolis-Hastings updates to draw samples from the a-posteriori distribution over
model coefficients and latent variables.
1.4.3 An augmented DAG for MCMC sampling
The likelihood function associated with the probabilistic model in Figure 1.3 highly
non-linear in the model coefficients and we have to use MCMC methods to in-
fer them. As already indicated, we want to use Gibbs updates wherever possible.
Only such variables that do not allow Gibbs moves will be updated with Metropolis-
Hastings steps. Following the ideas of the EM procedure, we introduce latent indi-
cator variables, di, which indicate the kernel number of the Gaussian prior over the
latent variables ½i;j .
Figure 1.4 shows a DAG that results from the DAG in Figure 1.3 when aug-
mented with all coefficients of the probabilistic model and the hyper-parameters of
the corresponding priors. In order to keep the graph simple, Figure 1.4 displays only
those variables, we need to derive the MCMC updates for inference. In particular we
illustrate only one observation model. Other sensors are indicated by dots.
The state variable di is conditionally dependent on its predecessor and on the
transition probability T . Class labels, ti, depend on the state variable and on the pre-
ceding label ti¡1. The state conditional transition probabilities are summarized by
W . For both the transition probabilities, T = P (dijdi¡1) 8 di; di¡1, and prior al-
location probabilities, W = P (tijdi; ti¡1) 8 di; ti; ti¡1, we use a Dirichlet-prior.
Conditional on the latent state di, we have the observation model for the latent
variables ½i;j and the corresponding model indicator Ii;j . The observation model
for Ii;j is a multinomial-one distribution with observation probabilities ¦j =
P (Ii;j jdi) 8 Ii;j ; di. These probabilities are given a Dirichlet prior using ±¦ as prior
counts. Except that we do not infer the number opf kernesl, the model for ½i;j is
largely identical to the model used by [16] for their one dimensional mixture of
Page 13
hidden
1 Bayes Consistent Classification of EEG Data by Approximate Marginalisation 13
i
i-1d
d i
ρi,j Ii,j
X i,j
λj
j α jβ
hg j j
κ j
ξ j
µ j
T
δ T

Πd
t
t
δWi-1
W
Fig. 1.4. A DAG for parameter inference. In order to keep it simple, the DAG shows only one
of the observation models for ½i;j , Ii;j one pair of latent variables di¡1, di, and one pair of
class labels, ti¡1, ti. We use three dots to indicate that there could be more sensors. The DAG
shows transition probabilities for the states, T = P (dijdi¡1) 8 di; di¡1 and for the class
labels W = P (tijdi; ti¡1) 8 di; ti; ti¡1. The multinomial observation model for the model
indicator is specified by ¦j = P (Ii;j jdi) 8 Ii;j ; di. Finally we have a hierarchical model
for specifying the observation model for the lattice filter coefficients specified by ¹j and ¸j .
As before, square nodes denote observed quantities and circles are latent variables. However
there is an exception since during model inference some of the ti’s are observed.
Gaussians analysis with varying number of kernels. The means, ¹j , are given a nor-
mal prior with mean »j and precision ·j . Each component has its own precision
(inverse variance) ¸j . In order to avoid problems with singular solutions the vari-
ances are coupled with hyper-parameters ® and ¯j . The latter, ¯j , has itself a Gamma
prior. This hierarchical prior specification allows for informative priors without in-
troducing large dependencies on the values of the hyper-parameters. The difference
between our observation model and the one used in [16] is that ½i;j are latent vari-
ables, with the time series Xi;j being conditionally dependent on ½i;j and Ii;j .
Page 14
hidden
14 P. Sykacek, I. Rezek and S. Roberts
1.4.4 Specifying priors
In order to be able to derive an MCMC scheme to sample from the a-posteriori distri-
butions of model coefficients and latent variables, we need to specify the functional
form as well as the parameters of all priors from the DAG in Figure 1.4. Gibbs sam-
pling requires full conditional distributions from which we can draw efficiently. The
full conditional distributions are the distributions of model coefficients when con-
ditioning on all other model coefficients, latent variables and data. Apart from the
EM-like idea to introduce latent variables, tractable distributions will be obtained by
using so-called conjugate priors, as discussed in [17].
In order to allow Gibbs updates for most of the parameters, we use the following
prior specification:
² Each component mean, ¹j;d, is given a Gaussian prior: ¹j;d » N1(»j ; ·¡1j ).
² The precision is given a Gamma prior: ¸j;d » ¡ (®; ¯j).
² The hyper-parameter, ¯j , gets a Gamma hyper-prior: ¯j » ¡ (g; h).
² The transition probabilities for the latent kernel indicators, T d get a Dirichlet
prior: T d » D(±1T ; :::; ±DT ), with D denoting the number of kernels.
² The transition probabilities for the class labels, W k;d, get a Dirichlet prior:
W k;d » D(±1W ; :::; ±KW ), with K denoting the number of class labels.
² The observation probabilities of the model indicators, ¦d get a Dirichlet prior:
¦d » D(±1¦ ; :::; ±D¦), with D denoting the number of kernels.
The quantitative settings are similar to those used in [16]: Values for ® are between
1 and 2, g is usually between 0:2 and 1 and h is typically between 1=R2max and
10=R2max, with Rmax denoting the largest input range. The mean, ¹j , gets a Gaus-
sian prior centred at the midpoint, »j , with inverse variance ·j = 1=R2j , where Rj
is the range of the j-th input. The multinomial priors of the prior allocation counts
and the prior transition counts are set up with equal probabilities for all counters. We
set all prior counts i.e. for ±T , ±W and ±¦ to 1, which gives the most uninformative
proper prior.
1.4.5 MCMC updates of coefficients and latent variables
The prior specification proposed in the last subsection enables us to use mainly
Gibbs updates. The latent variables ½i;j however need to be sampled via more gen-
eral Metropolis-Hastings updates. The main difficulty is that we regard “input” vari-
ables as being latent. In other words: conventional approaches condition on the ½i;j’s,
whereas our approach regards them as random variables which have to be updated
as well.
We will first summarise the Gibbs updates for the model coefficients. The expres-
sions condition on hyper parameters, other model coefficients, hidden states, class la-
bels and the latent representation of preprocessing (i.e. ½i;j and Ii;j). During model
inference we need to update all unobserved variables of the DAG, whereas for pre-
dictions we update only the variables shown in the DAG in Figure 1.3. The updates
for the model coefficients are done using full conditional distributions, which have
Page 15
hidden
1 Bayes Consistent Classification of EEG Data by Approximate Marginalisation 15
the same functional forms as the corresponding priors. These full conditionals fol-
low previous publications [11], with some modifications required by the additional
Markov dependency between successive class labels.
Update of the transition probabilities, T
The full conditional of the transition probabilities T di [di+1] = P (di+1jdi) is a
Dirichlet distribution. The expression depends on the prior counts, ±T , and on all
hidden states, di.
T d » D(±T + ndd;1; :::; ±T + ndd;D); (1.23)
with ndd;1; :::; ndd;D denoting the number of transitions from state di = d to di+1 =
f1; ::; Dg. The prior probability of the hidden states P T is the normalised eigenvec-
tor of T that corresponds to the eigenvalue 1.
Update of the transition probabilities for class labels, W
The full conditional of the transition probabilities, W di;ti¡1 [ti] = P (tijdi; ti¡1) is
a Dirichlet distribution. These transition probabilities are conditional on both, the
previous class label and the current state. We thus obtain the posterior counts, n¿d;t,
by counting transitions from class label ¿ to class label t, given the current latent
state is d. The transition probability is a Dirichlet distribution.
W ¿d » D(±W + n¿d;1; :::; ±T + n¿d;K): (1.24)
Conditional on d, the prior probability of the class labels PW;d is the normalised
eigenvector of W d that corresponds to the eigenvalue 1.
Update of the observation probability of model orders, ¦j
The full conditional of the observation probabilities of model order ¦j [di; Ii;j ] =
P (Ii;j jdi; j) is a Dirichlet distribution. The expression depends on the prior counts,
±¦ , on the model indicators Ii;j and on the state di. Each sensor j has its own set of
probabilities P j .
¦j;d » D(±P + nIjd;0; :::; ±P + nIjd;Imax); (1.25)
where nIjd;I denotes the number of cases, in which Ii;j ´ I and di ´ d.
Updating the Gaussian observation model
We use a separate Gaussian observation model for the latent features ½i;j of each
sensor j. Each observation model needs three updates.
The full conditional of the kernel mean, ¹j , is a Normal distribution. The ex-
pression depends on the hyper parameters ·j and »j , on the hidden states di, on the
model indicators Ii;j , on the covariance matrix ¸j and on the latent coefficients ½i;j .
Page 16
hidden
16 P. Sykacek, I. Rezek and S. Roberts
¹j;d[k] » N (¹^j;d[k]; ¾¹j;d[k]) (1.26)
with
¹^j;d[k] = (n¹j;d[k]¸j;d[k; k] + ¯j [k])¡1(n¹j;d[k]¸j;d[k; k]¹½d;j [k] + ·j [k]»j [k])
¾¹j;d[k] = (n¹j;d[k]¸j;d[k; k] + ¯j [k])¡1;
where we use index k to denote the k-th dimension of the latent vector ½i;j . The
dependency of Equation (1.26) on Ii;j is implicit in
n¹j;d[k] =
X
ijIi;j´1
±(di ´ d)
and in
¹½d;j [k] =
1
n¹j;d[k]
X
ijIi;j´1
±(di ´ d)½i;j [k]:
The full conditional of the kernel covariance, ¸j;d[k], is a Gamma distribution. The
expression depends on the hyper parameter ®j and ¯j , on the hidden states, di, on
the corresponding kernel mean ¹j;d[k]and on the model indicators Ii;j .
¸j;d[k] » ¡ (®^; ^¯) (1.27)
with
®^ = ®j +
n¹j;d[k]
2
^¯ = ¯j [k] +
1
2
X
ijIi;j´1
±(di ´ d)(½i;j [k]¡ ¹j;d[k])2
The full conditional of the hyper parameter¯j depends on the hyper-hyper-parameters
gj and hj , on ®j and on ¸j;d.
w¯j [k] » ¡ (gj +D®j ;hj [k] +
X
d
¸j;d[k]) (1.28)
1.4.6 Gibbs updates for hidden states and class labels
Updating di
The full conditionals for di are multinomial-one distributions. The first state has
no preceding state d0. We thus use the unconditional prior probability of the state,
PT (d1) instead of Td0 [d1].
di » Mn(1; fP (dij:::) 8 di = 1::Dg) (1.29)
with
P (d1j:::) =
PT [d1]Td1 [d2]PW;d1 [t1]
Q
j p(½1;j ; I1;j jd1)P
d1(PT [d1]Td1 [d2]PW;d1 [t1]
Q
j p(½1;j ; I1;j jd1)
P (di 6=1j:::) =
Tdi¡1 [di]Tdi [di+1]Wdi;ti¡1 [ti]
Q
j p(½i;j ; Ii;j jdi)P
di(Tdi¡1 [di]Tdi [di+1]Wdi;ti¡1 [ti]
Q
j p(½i;j ; Ii;j jdi)
Page 17
hidden
1 Bayes Consistent Classification of EEG Data by Approximate Marginalisation 17
For the last state of the Markov chain, the successor di+1 does not exist and the
expression of the probability, P (di 6=1j:::), does not include the term Tdi [di+1].
Updating ti
Unknown class labels, ti, are updated by multinomial one distributions.
Ti » Mn(1; fP (tij:::) 8 ti = 1::Kg) (1.30)
with
P (t1j:::) = PW;d1 [t1]Wd2;t1 [t2]P
t1 PW;d1 [t1]Wd2;t1 [t2]
P (ti 6= 1j:::) =
Wdi;ti¡1 [ti]Wdi+1;ti [ti+1]P
ti Wdi;ti¡1 [ti]Wdi+1;ti [ti+1]
Since class label ti+1 does not exist for the last segment, we have to remove the term
Wdi+1;ti [ti+1] when expressing the corresponding probability P (tij:::).
1.4.7 Approximate updates of the latent feature space
We finally have to formulate updates for the latent feature space, i.e. for the variables
½i;j and Ii;j . These updates involve the posteriors formulated in Equations (1.10)
and (1.12) and are drawn from the “joint conditional” distribution p(Ii;j ;½i;j j:::) /
p(Ii;j ;½i;j jXi;j)p(dijIi;j ;½i;j). We draw from p(Ii;j ;½i;j j:::) by first proposing
from (I 0i;j ;½0i;j) » p(Ii;j ;½i;j jXi;j) and then accepting (I 0i;j ;½0i;j) by a Metropo-
lis Hastings acceptance probability. We draw model indicator I 0i;j from p(Ii;j jXi;j)
which is a multinomial-one distribution.
I 0i;j »Mn(1; fP (Ii;j jXi;j)); (1.31)
with P (Ii;j jXi;j) defined in Equation (1.10). Using the indicator variable we then
propose ½0i;j .
½0i;j »
(
8I 0i;j ´ 1 : ½0i;j » p(½i;j jXi;j)
8I 0i;j ´ 0 : ½0i;j = []
; (1.32)
where p(½i;j jXi;j) is a product of the distributions in Equation (1.12) and the second
case denotes the white noise case, in which the latent parameter ½0i;j has dimension
zero. In order to get a sample from the full conditional distribution, we have to cal-
culate the acceptance probability
Pa = min
µ
1; P (dij½
0
i;j ; I 0i;j)
P (dij½i;j ; Ii;j)

(1.33)
where
P (dij½0i;j ; I 0i;j) =
8
><
>:
8I 0i;j ´ 1 :
p(½0i;j j¹j;di ;¸j;di )¦di;j [I0i;j ]P
di p(½0i;j j¹j;di ;¸j;di )¦di;j [I0i;j ]
8I 0i;j ´ 0 :
¦di;j [I0i;j ]P
di ¦di;j [I
0
i;j ]
Page 18
hidden
18 P. Sykacek, I. Rezek and S. Roberts
and accept the new values of the latent features (I 0i;j ;½0i;j) according to this proba-
bility. We would like to point out that this scheme is an approximation to the exact
updates of the latent space since the proposal distributions in Equations (1.10) and
(1.12) do not consider the correct prior which would be the mixture distribution
P (di)p(½i;j jdi)p(Ii;j jdi). Instead we use the priors specified in section 1.2. How-
ever, we argue that the difference is not large if the number of samples is sufficient
since in this case the likelihood by far outweighs the prior.
1.4.8 Algorithms
Model inference
Algorithm 1 Pseudo-code for sweeps during model inference.
0000000000000000000000000000000000000
8 d initialise(W d;T d)
8 j initialise(¯j)
8 j; d initialise(¹j;d;¸j;d;¦j;d)
8 i initialise(di; ti) % only missing ti are initialised
8 i; j initialise(Ii;j ;½i;j)
sweepcounter=0
REPEAT
8 d update(T d) % according to Equation (1.23)
8 d update(W d) % according to Equation (1.24)
8 j; d update(¦j;d) % according to Equation (1.25)
8 j; d update(¹j;d) % according to Equation (1.26)
8 j; d update(¸j;d) % according to Equation (1.27)
8 j update(¯j) % according to Equation (1.28)
8 i; j (½0i;j ; I 0i;j) » p(Ii;j ;½i;j jXi;j)% according to Eqns. (1.10) and (1.12)
accept(½0i;j ; I 0i;j) with Pa % according to Equation (1.33)
8 i update(di) % according to Equation (1.29)
8 i if ti missing
update(ti) % according to Equation (1.30)
inc(sweepcounter)
sample[sweepcounter]=fT d;W d;¦j;d;¹j;d;¸j;d 8j; dg
UNTIL (sweepcounter > maxcount)
For model inference we start the MCMC scheme by drawing initial model param-
eters from their priors. We initialize all (Ii;j ;½i;j) » p(Ii;j ;½i;j jXi;j) with samples
drawn from the posteriors in Equations (1.10) and (1.12) and all unobserved states
di are drawn from the prior probabilities P (di). The class labels of unlabelled seg-
ments are drawn from their prior P (tijdi) as well. After this initialisation we update
according to the full conditionals and in the case of feature updates according to the
single component Metropolis Hastings step. Pseudo code of these updates is shown
in Algorithm 1.
Page 19
hidden
1 Bayes Consistent Classification of EEG Data by Approximate Marginalisation 19
Predictions
Algorithm 2 Pseudo-code for sweeps during predictions.
0000000000000000000000000000000000000
8 i initialise(di; ti)
8 i; j initialise(Ii;j ;'i;j)
sweepcounter=0
expcounter=0
REPEAT
inc(sweepcounter)
fT d;W d;¦j;d;¹j;d;¸j;d 8j; dg = sample[sweepcounter]
8 i; j (½0i;j ; I 0i;j) » p(Ii;j ;½i;j jXi;j)% according to Eqns. (1.10) and (1.12)
accept(½0i;j ; I 0i;j) with Pa % according to Equation (1.33)
8 i update(di) % according to Equation (1.29)
8 i update(ti) % according to Equation (1.30)
if sweepcounter > burnincounter
P (ti) = P (ti) + 1
expcounter = expcounter + 1
UNTIL (sweepcounter > maxcount)
8 ti P (ti) = P (ti)=expcounter
In order to get consistent estimates of the beliefs of the states, predictions have
to be marginalized over ½i;j and Ii;j . The integrals need to be solved numerically.
We use all samples drawn from the posterior in order to allow the ½i;j’s of the test
data to converge. All sample expectations are then taken after allowing for a burn
in period. Apart from beliefs about states, we can also obtain expectations from all
latent variables - most interestingly from the Ii;j’s and ½i;j’s.
Predictions are based on an approximation of the a-posteriori distribution of class
labels, ti, latent allocations, di and latent variables, ½i;j . We initialize the latent vari-
ables di and the class labels ti by drawing according to the respective prior prob-
abilities. The initial (I 0i;j ;½0i;j) » p(Ii;j ;½i;j jXi;j) are drawn from the posteriors
in Equations (1.10) and (1.12). The coefficients are then updated using full condi-
tional distributions for ti and di and the single component Metropolis Hastings step
for (Ii;j ; ½i;j). During each round of Algorithm 2, we use the next sample from
the Markov chain obtained during parameter inference. We are interested in obtain-
ing the probabilities of class labels ti and expected values of the latent variables
(Ii;j ; ½i;j). We estimate these quantities by averaging after having allowed for a burn
in period.
Assessing convergence
Although in theory MCMC algorithms can approximate arbitrary distributions, the
difficulty is that there is a random error inherent to the approach. This means that we
Page 20
hidden
20 P. Sykacek, I. Rezek and S. Roberts
need to estimate how many samples from the posterior we need in order to be able
to approximate the desired value with sufficient accuracy. Different approaches to
obtain such estimates are discussed at length in the MCMC literature (e.g. [18] and
[19]). Although all suggested approaches can diagnose non convergence, there is no
way that allows to assess convergence with certainty. We thus need to treat quoted
numbers with caution. Despite this difficulty, it is important to have at least a rough
idea about how many samples to draw. The suggested methods follow two different
strategies.
² Along the lines of [20], we can compare the samples obtained from multiple runs.
² As is suggested in [21], we can use one run and apply Markov chain theory to
assess how many samples we need to discard at the beginning to get estimates
that are independent of the starting value and how many samples we need to
converge to a sufficiently accurate result.
Although there are cases where the second case might fail to diagnose slow conver-
gence, [19] point out that the multiple chain approach is less efficient since we have
to wait more than once until the Markov chain visits a high probability region. The
model proposed in this work has definitely a very complicated structure, such that
initial convergence might be slow. We thus apply the method suggested in [21] to the
likelihood of the class labels we calculate from the Markov chain. We should point
out that we cannot use the model coefficients directly since the mixture model is not
identified and label switching might give a wrong sense of the actual mixing. Calcu-
lations suggest that we should draw around 15; 000 samples which is confirmed by
observing virtually no difference in probabilities obtained from repeated runs.
1.5 Experiments
This section evaluates the proposed method on two biomedical classification prob-
lems. The first experiment classifies in segments of sleep EEG, whether sleep spin-
dles are present or not. We use this data because it reflects a problem of very un-
balanced class labels. For the second experiment we apply the proposed method to
classification of cognitive states of segments of EEG recordings. In this case the
cognitive experiments have been designed to obtain data with balanced class labels.
1.5.1 Data
Classification of sleep spindles
Sleep spindles are an important phenomenon in sleep EEG. This experiment uses
data recorded from the standard 10-20 electrodes F4, C4 and P4 [22, 23] that were
recorded against averaged ear potential. The sampling frequency used here was 102:4
Hz. This data is segmented into segments of 0:625 seconds duration such that we get
16 segments for a 10 second recording. A segment is labelled as containing a spin-
dle, if more than half of the segment shows spindle activity. This setup results in a
Page 25
hidden
1 Bayes Consistent Classification of EEG Data by Approximate Marginalisation 25
References
1. G. E. P. Box and G. M. Jenkins. Time Series Analysis, forecasting and control. Holden-
Day, Oakland, California, 1976.
2. G. L. Bretthorst. Bayesian analysis i: Parameter estimation using quadratur nmr models.
Journal of Magnetic Resonance, 88:533–551, 1990.
3. J. J. K. ´O Ruanaidh and W. J. Fitzgerald. Numerical Bayesian Methods Applied to Signal
Processing. Springer-Verlag, New York, 1995.
4. D. J. C. MacKay. Bayesian interpolation. Neural Computation, 4:415–447, 1992.
5. D. J. C. MacKay. The evidence framework applied to classification networks. Neural
Computation, 4:720–736, 1992.
6. R. M. Neal. Bayesian Learning for Neural Networks. Springer, New York, 1996.
7. L. Ljung. System Identification, Theory for the User. Prentice-Hall, Englewood Cliffs,
New Jersey, 1999.
8. W. A. Wright. Bayesian approach to Neural-Network modelling with input uncertainty.
IEEE Trans. Neural Networks, 10:1261–1270, 1999.
9. P. Dellaportas and S. A. Stephens. Bayesian analysis of errors-in-variables regression
models. Biometrics, 51:1085–1095, 1995.
10. H. Jeffreys. Theory of Probability. Clarendon Press, Oxford, third edition, 1961.
11. P. Sykacek and S. Roberts. Bayesian time series classification. In T.G. Dietterich,
S. Becker, and Z. Gharamani, editors, Advances in Neural Processing Systems 14, pages
937–944. MIT Press, 2002.
12. M. Abramowitz and I. A. Stegun. Handbook of Mathematical Functions with Formulas,
Graphs and Mathematical Tables. Dover, New York, 1965.
13. S. Brunak and P. Baldi. Bioinformatics. MIT Press, Cambridge, Massachusetts, 1998.
14. A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data
via the EM algorithm (with discussion). Journal of the Royal Statistical Society series B,
39:1–38, 1977.
15. L. E. Baum, T. Petrie, G. Soules, and N. Weiss. A maximization technique occurring in the
statistical analysis of probabilistic functions of Markov chains. Annals of Mathematical
Statistics, 41:164–171, 1970.
16. S. Richardson and P. J. Green. On Bayesian analysis of mixtures with an unknown number
of components. Journal Royal Stat. Soc. B, 59:731–792, 1997.
17. J. M. Bernardo and A. F. M. Smith. Bayesian Theory. Wiley, Chichester, 1994.
18. W. R. Gilks, S. Richardson, and D. J. Spiegelhalter (ed.). Markov Chain Monte Carlo in
Practice. Chapman & Hall, London, 1996.
19. C. P. Robert and G. Casella. Monte Carlo statistical methods. Springer, New York, 1999.
20. A. Gelman and D. B. Rubin. Inference from iterative simulations using multiple se-
quences (with discussion). Statist. Sci., 7:457–511, 1992.
21. A. E. Raftery and S. M. Lewis. Implementing MCMC. In W.R. Gilks, S. Richardson,
and D.J. Spiegelhalter, editors, Markov Chain Monte Carlo in practice, pages 115–130.
Chapman & Hall, London, Weinheim, New York, 1996.
22. HH. Jasper. Report of the committee on methods of clinical examination in electroen-
cephalography. Clinical Neurophysiology, 10:370–1, 1958.
23. HH. Jasper. Appendix to report of the committee on methods of clinical examination
in EEG: the ten-twenty electrode system of the International Federation of Electroen-
cephalography. Clinical Neurophysiology, 10:371–375, 1958.

Sign up today - FREE

Mendeley saves you time finding and organizing research. Learn more

  • All your research in one place
  • Add and import papers easily
  • Access it anywhere, anytime

Start using Mendeley in seconds!

Already have an account? Sign in

Readership Statistics

4 Readers on Mendeley
by Discipline
 
 
by Academic Status
 
50% Ph.D. Student
 
25% Post Doc
 
25% Researcher (at an Academic Institution)
by Country
 
50% United States
 
25% Australia
 
25% Austria