A new approach to analyzing gene expression time series data
Proceedings of the sixth annual international conference on Computational biology RECOMB 02 (2002)
- ISBN: 1581134983
- DOI: 10.1145/565196.565202
Available from portal.acm.org
or
Abstract
We present algorithms for time-series gene expression analysis that permit the principled estimation of unobserved timepoints, clustering, and dataset alignment. Each expression profile is modeled as a cubic spline (piecewise polynomial) that is estimated from the observed data and every time point influences the overall smooth expression curve. We constrain the spline coefficients of genes in the same class to have similar expression patterns, while also allowing for gene specific parameters....
Available from portal.acm.org
Page 1
A new approach to analyzing gene expression time series data
A New Approach to Analyzing Gene Expression Time Series Data
Ziv Bar-Joseph Georg Gerber David K. Gifford Tommi S. Jaakkola
MIT Lab for Computer Science and MIT AI Lab
200 Technology Square, Cambridge, MA 02139
fzivbj,georg,giffordg@mit.edu, tommi@ai.mit.edu
Itamar Simon
Whitehead Institute for Biomedical Research
9 Cambridge Center, Cambridge, MA 02142
simon@wi.mit.edu
Abstract
We present algorithms for time-series gene expression analy-
sis that permit the principled estimation of unobserved time-
points, clustering, and dataset alignment. Each expression
profile is modeled as a cubic spline (piecewise polynomial)
that is estimated from the observed data and every time point
influences the overall smooth expression curve. We constrain
the spline coefficients of genes in the same class to have sim-
ilar expression patterns, while also allowing for gene specific
parameters. We show that unobserved time-points can be re-
constructed using our method with 10-15% less error when
compared to previous best methods. Our clustering algo-
rithm operates directly on the continuous representations of
gene expression profiles, and we demonstrate that this is par-
ticularly effective when applied to non-uniformly sampled
data. Our continuous alignment algorithm also avoids diffi-
culties encountered by discrete approaches. In particular, our
method allows for control of the number of degrees of free-
dom of the warp through the specification of parameterized
functions, which helps to avoid overfitting. We demonstrate
that our algorithm produces stable low-error alignments on
real expression data and further show a specific application
to yeast knockout data that produces biologically meaningful
results.
1 Introduction
Principled methods for estimating unobserved time-points,
clustering, and aligning microarray gene expression time-
series are needed to make such data useful for detailed anal-
ysis. Datasets measuring temporal behavior of thousands of
genes offer rich opportunities for computational biologists.
For example, Dynamic Bayesian Networks may be used
to build models and try to understand how genetic responses
unfold. However, such modeling frameworks need a suf-
ficient quantity of data in the appropriate format. Current
gene expression time-series data often do not meet these re-
quirements, since they may be missing data points, sampled
non-uniformly, and measure biological processes that exhibit
temporal variation.
In many applications, researchers may face the problem of
reconstructing unobserved gene expression values. Values
may not have been observed for two reasons. First, errors
may occur in the experimental process that lead to corrup-
tion or absence of some expression measurements. Second,
we may want to estimate expression values at time points
different from those originally sampled. In either case, the
nature of microarray data makes straightforward interpola-
tion difficult. Data are often very noisy and there are few
replicates. Thus, simple techniques such as interpolation of
individual genes can lead to poor estimates. Additionally, in
many cases there are a large number of missing time-points
in a series for any given gene, making gene specific interpo-
lation infeasible. In the case of clustering, the treatment of
time-series can be problematic, as a time-series represents a
set of dependent experiments. A particular problem arises
when series are not sampled uniformly such as in [14, 3, 7].
Variability in the timing of biological processes further
complicates gene expression time-series analysis. The rate
at which similar underlying processes such as the cell-cycle
unfold can be expected to differ across organisms, genetic
variants, and environmental conditions. For instance, Spell-
man et al [14] analyze time-series data for the yeast cell-
cycle in which different methods were used to synchronize
the cells. It is clear that the cycle lengths across the different
experiments vary considerably, and that the series begin and
end at different phases of the cell-cycle. Thus, one needs a
method to align such series to make them comparable.
In this paper we use statistical spline estimation to rep-
resent time-series gene expression profiles as continuous
curves. Our method takes into account the actual duration
each time point represents, unlike most previous approaches
Ziv Bar-Joseph Georg Gerber David K. Gifford Tommi S. Jaakkola
MIT Lab for Computer Science and MIT AI Lab
200 Technology Square, Cambridge, MA 02139
fzivbj,georg,giffordg@mit.edu, tommi@ai.mit.edu
Itamar Simon
Whitehead Institute for Biomedical Research
9 Cambridge Center, Cambridge, MA 02142
simon@wi.mit.edu
Abstract
We present algorithms for time-series gene expression analy-
sis that permit the principled estimation of unobserved time-
points, clustering, and dataset alignment. Each expression
profile is modeled as a cubic spline (piecewise polynomial)
that is estimated from the observed data and every time point
influences the overall smooth expression curve. We constrain
the spline coefficients of genes in the same class to have sim-
ilar expression patterns, while also allowing for gene specific
parameters. We show that unobserved time-points can be re-
constructed using our method with 10-15% less error when
compared to previous best methods. Our clustering algo-
rithm operates directly on the continuous representations of
gene expression profiles, and we demonstrate that this is par-
ticularly effective when applied to non-uniformly sampled
data. Our continuous alignment algorithm also avoids diffi-
culties encountered by discrete approaches. In particular, our
method allows for control of the number of degrees of free-
dom of the warp through the specification of parameterized
functions, which helps to avoid overfitting. We demonstrate
that our algorithm produces stable low-error alignments on
real expression data and further show a specific application
to yeast knockout data that produces biologically meaningful
results.
1 Introduction
Principled methods for estimating unobserved time-points,
clustering, and aligning microarray gene expression time-
series are needed to make such data useful for detailed anal-
ysis. Datasets measuring temporal behavior of thousands of
genes offer rich opportunities for computational biologists.
For example, Dynamic Bayesian Networks may be used
to build models and try to understand how genetic responses
unfold. However, such modeling frameworks need a suf-
ficient quantity of data in the appropriate format. Current
gene expression time-series data often do not meet these re-
quirements, since they may be missing data points, sampled
non-uniformly, and measure biological processes that exhibit
temporal variation.
In many applications, researchers may face the problem of
reconstructing unobserved gene expression values. Values
may not have been observed for two reasons. First, errors
may occur in the experimental process that lead to corrup-
tion or absence of some expression measurements. Second,
we may want to estimate expression values at time points
different from those originally sampled. In either case, the
nature of microarray data makes straightforward interpola-
tion difficult. Data are often very noisy and there are few
replicates. Thus, simple techniques such as interpolation of
individual genes can lead to poor estimates. Additionally, in
many cases there are a large number of missing time-points
in a series for any given gene, making gene specific interpo-
lation infeasible. In the case of clustering, the treatment of
time-series can be problematic, as a time-series represents a
set of dependent experiments. A particular problem arises
when series are not sampled uniformly such as in [14, 3, 7].
Variability in the timing of biological processes further
complicates gene expression time-series analysis. The rate
at which similar underlying processes such as the cell-cycle
unfold can be expected to differ across organisms, genetic
variants, and environmental conditions. For instance, Spell-
man et al [14] analyze time-series data for the yeast cell-
cycle in which different methods were used to synchronize
the cells. It is clear that the cycle lengths across the different
experiments vary considerably, and that the series begin and
end at different phases of the cell-cycle. Thus, one needs a
method to align such series to make them comparable.
In this paper we use statistical spline estimation to rep-
resent time-series gene expression profiles as continuous
curves. Our method takes into account the actual duration
each time point represents, unlike most previous approaches
Page 2
that treat expression time-series like static data consisting
of vectors of discrete samples [7, 11, 8]. Our algorithm
generates a set of continuous curves that can be used di-
rectly for estimating unobserved data. However, although
our method uses spline curves (piecewise polynomials) to
represent gene expression profiles, it is not reasonable to fit
each gene with an individual spline due to the issues with
microarray datasets discussed above. Instead, we constrain
the spline coefficients of genes in the same class to covary
similarly, while also allowing for gene specific parameters.
A class is a set of genes with similar expression profiles that
may be constructed using prior biological knowledge or clus-
tering methods. We present a clustering algorithm that infers
classes automatically by operating directly on the continu-
ous representations of expression profiles. This is particu-
larly effective when applied to non-uniformly sampled data.
However, note that our method does require data that has
been sampled at a sufficiently high rate. We demonstrate
in Section 5 that our method performs well on several such
datasets, but for other datasets that have been sampled at
rates too low to capture changes in the underlying biological
processes, our method will not be effective. A future direc-
tion would be to use our method to determine the quality of
the sampling rate.
Our alignment algorithm uses the same spline representa-
tion of expression profiles to continuously time-warp series.
First, a parameterized function is chosen that maps the time-
scale of one series into another. Because we use parame-
terized functions, we are explicitly specifying the number
of allowed degrees of freedom, which is helpful in avoiding
overfitting. Our algorithm seeks to maximize the similarity
between the two sets of expression profiles by adjusting the
parameters of the warping function.
The remainder of this paper is organized as follows. In
Section 2 we discuss our algorithm for estimating unob-
served data and in Section 3 we extend this algorithm to
perform clustering. In Section 4 we present our alignment
algorithm. Section 5 presents applications of our method to
expression data and Section 6 concludes the paper and sug-
gests directions for future work.
1.1 Related Work
Recently, several papers have focused on modeling and an-
alyzing the temporal aspects of gene expression data. In
Holter et al [9] a time translational matrix is used to model
the temporal relationships between different modes of the
Singular Value Decomposition (SVD). Unlike our work, this
method focuses on the SVD modes and not on specific genes.
In addition, only relationships between time points that are
sampled at the lowest common frequencies can be studied.
Thus, not all available expression data can be used. In Zhao
et al [17] a statistical model is fit to all genes in order to find
those that are cell cycle regulated. This method uses a cus-
tom tailored model, relying on the periodicity of the specific
dataset analyzed, and is thus less general than our approach.
Several papers have used simple interpolation techniques
to estimate missing values for gene expression data. Aach
et al [1] use linear interpolation to estimate gene expres-
sion levels for unobserved time-points. D’haeseleer [6] use
spline interpolation on individual genes to interpolate miss-
ing time-points. As we show in Section 5.2, both techniques
cannot approximate the expression curve of a gene well, es-
pecially if there are many missing values. In Troyanskaya
et al [15] several techniques for missing value estimations
were explored. However, none of the suggested techniques
take into account the actual times the points correspond to,
and thus time series data is treated in the same way as static
data. As a consequence, their techniques cannot estimate
values for time-points between those measured in the origi-
nal experiments.
There is a considerable statistical literature that deals
with the problem of analyzing non-uniformly sampled data.
These models, known as mixed-effect models [2] use spline
estimation methods to construct a common class profile for
their input data. Recently, James and Hastie [10] presented
a reduced rank mixed effects model that was used for classi-
fying medical time-series data. In this paper we extend these
methods to gene expression data. Unlike the above papers,
we focus on the gene specific aspects rather than the common
class profile. In addition, we present a method that is able to
deal with cases in which class membership is not given. An-
other difference between this work and [10] is that we do not
use a reduced rank approach, since gene expression datasets
contain information about thousands of genes.
Many clustering algorithms have been suggested for gene
expression analysis (see [12]). However, as far as we are
aware, all these algorithms treat their input as a vector of
data points, and do not take into account the actual times
at which these points were sampled. In contrast, our algo-
rithm weights time points differently according to the sam-
pling rate.
Aach et al [1] presented a method for aligning gene ex-
pression time-series that is based on Dynamic Time Warp-
ing, a discrete method that uses dynamic programming and is
conceptually similar to sequence alignment algorithms. Un-
like with our method, the allowed degrees of freedom of the
warp operation in Aach et al depends on the number of data
points in the time-series. Their algorithm also allow map-
pings of multiple time-points to a single point, thus stop-
ping time in one of the datasets. In contrast, our algorithm
avoids temporal discontinuities by using a continuous warp-
ing representation. There is also a substantial body of work
in the speech recognition and computer vision community
that deals with data alignment. For instance, non-stationary
Hidden Markov models with warping parameters have been
used for alignment of speech data [5], and mutual informa-
tion based methods have been used for registering medical
images [16]. However, these methods generally assume high
resolution data, which is not the case with available gene ex-
pression datasets.
of vectors of discrete samples [7, 11, 8]. Our algorithm
generates a set of continuous curves that can be used di-
rectly for estimating unobserved data. However, although
our method uses spline curves (piecewise polynomials) to
represent gene expression profiles, it is not reasonable to fit
each gene with an individual spline due to the issues with
microarray datasets discussed above. Instead, we constrain
the spline coefficients of genes in the same class to covary
similarly, while also allowing for gene specific parameters.
A class is a set of genes with similar expression profiles that
may be constructed using prior biological knowledge or clus-
tering methods. We present a clustering algorithm that infers
classes automatically by operating directly on the continu-
ous representations of expression profiles. This is particu-
larly effective when applied to non-uniformly sampled data.
However, note that our method does require data that has
been sampled at a sufficiently high rate. We demonstrate
in Section 5 that our method performs well on several such
datasets, but for other datasets that have been sampled at
rates too low to capture changes in the underlying biological
processes, our method will not be effective. A future direc-
tion would be to use our method to determine the quality of
the sampling rate.
Our alignment algorithm uses the same spline representa-
tion of expression profiles to continuously time-warp series.
First, a parameterized function is chosen that maps the time-
scale of one series into another. Because we use parame-
terized functions, we are explicitly specifying the number
of allowed degrees of freedom, which is helpful in avoiding
overfitting. Our algorithm seeks to maximize the similarity
between the two sets of expression profiles by adjusting the
parameters of the warping function.
The remainder of this paper is organized as follows. In
Section 2 we discuss our algorithm for estimating unob-
served data and in Section 3 we extend this algorithm to
perform clustering. In Section 4 we present our alignment
algorithm. Section 5 presents applications of our method to
expression data and Section 6 concludes the paper and sug-
gests directions for future work.
1.1 Related Work
Recently, several papers have focused on modeling and an-
alyzing the temporal aspects of gene expression data. In
Holter et al [9] a time translational matrix is used to model
the temporal relationships between different modes of the
Singular Value Decomposition (SVD). Unlike our work, this
method focuses on the SVD modes and not on specific genes.
In addition, only relationships between time points that are
sampled at the lowest common frequencies can be studied.
Thus, not all available expression data can be used. In Zhao
et al [17] a statistical model is fit to all genes in order to find
those that are cell cycle regulated. This method uses a cus-
tom tailored model, relying on the periodicity of the specific
dataset analyzed, and is thus less general than our approach.
Several papers have used simple interpolation techniques
to estimate missing values for gene expression data. Aach
et al [1] use linear interpolation to estimate gene expres-
sion levels for unobserved time-points. D’haeseleer [6] use
spline interpolation on individual genes to interpolate miss-
ing time-points. As we show in Section 5.2, both techniques
cannot approximate the expression curve of a gene well, es-
pecially if there are many missing values. In Troyanskaya
et al [15] several techniques for missing value estimations
were explored. However, none of the suggested techniques
take into account the actual times the points correspond to,
and thus time series data is treated in the same way as static
data. As a consequence, their techniques cannot estimate
values for time-points between those measured in the origi-
nal experiments.
There is a considerable statistical literature that deals
with the problem of analyzing non-uniformly sampled data.
These models, known as mixed-effect models [2] use spline
estimation methods to construct a common class profile for
their input data. Recently, James and Hastie [10] presented
a reduced rank mixed effects model that was used for classi-
fying medical time-series data. In this paper we extend these
methods to gene expression data. Unlike the above papers,
we focus on the gene specific aspects rather than the common
class profile. In addition, we present a method that is able to
deal with cases in which class membership is not given. An-
other difference between this work and [10] is that we do not
use a reduced rank approach, since gene expression datasets
contain information about thousands of genes.
Many clustering algorithms have been suggested for gene
expression analysis (see [12]). However, as far as we are
aware, all these algorithms treat their input as a vector of
data points, and do not take into account the actual times
at which these points were sampled. In contrast, our algo-
rithm weights time points differently according to the sam-
pling rate.
Aach et al [1] presented a method for aligning gene ex-
pression time-series that is based on Dynamic Time Warp-
ing, a discrete method that uses dynamic programming and is
conceptually similar to sequence alignment algorithms. Un-
like with our method, the allowed degrees of freedom of the
warp operation in Aach et al depends on the number of data
points in the time-series. Their algorithm also allow map-
pings of multiple time-points to a single point, thus stop-
ping time in one of the datasets. In contrast, our algorithm
avoids temporal discontinuities by using a continuous warp-
ing representation. There is also a substantial body of work
in the speech recognition and computer vision community
that deals with data alignment. For instance, non-stationary
Hidden Markov models with warping parameters have been
used for alignment of speech data [5], and mutual informa-
tion based methods have been used for registering medical
images [16]. However, these methods generally assume high
resolution data, which is not the case with available gene ex-
pression datasets.
Sign up today - FREE
Mendeley saves you time finding and organizing research. Learn more
- All your research in one place
- Add and import papers easily
- Access it anywhere, anytime
Start using Mendeley in seconds!
Readership Statistics
23 Readers on Mendeley
by Discipline
by Academic Status
26% Ph.D. Student
22% Post Doc
13% Researcher (at an Academic Institution)
by Country
30% United States
9% United Kingdom
9% China



