Principal Component Analysis and Long Time Protein Dynamics
Journal of Physical Chemistry (1996)
- ISSN: 00223654
- DOI: 10.1021/jp9536920
Available from pubs.acs.org
or
Abstract
It has been suggested that principal component analysis can identify slow modes in proteins and, thereby, facilitate the study of long time dynamics. However, sampling errors due to finite simulation times preclude the identification of slow modes that can be used for this purpose. This is demonstrated numerically with the aid of simulations of the protein G-actin and analytically with the aid of a model which is exactly recoverable by principal component analysis. Although principal component analysis usually demonstrates that a set of a small number of modes captures the majority of the fluctuations, the set depends on the particular sampling time window and its width.
Available from pubs.acs.org
Page 1
Principal Component Analysis and Long Time Protein Dynamics
Principal Component Analysis and Long Time Protein Dynamics
M. A. Balsera, W. Wriggers, Y. Oono, and K. Schulten*
Department of Physics and Beckman Institute, UniVersity of Illinois at Urbana-Champaign,
405 North Mathews AVenue, Urbana, Illinois 61801
ReceiVed: December 12, 1995; In Final Form: December 21, 1995
X
It has been suggested that principal component analysis can identify slow modes in proteins and, thereby,
facilitate the study of long time dynamics. However, sampling errors due to finite simulation times preclude
the identification of slow modes that can be used for this purpose. This is demonstrated numerically with the
aid of simulations of the protein G-actin and analytically with the aid of a model which is exactly recoverable
by principal component analysis. Although principal component analysis usually demonstrates that a set of
a small number of modes captures the majority of the fluctuations, the set depends on the particular sampling
time window and its width.
I. Introduction
Molecular dynamics has become an important tool to study
proteins.
1,2
One of the major obstacles limiting its usefulness
is the shortness of achievable simulation times. These times,
on the order of a few nanoseconds, are much shorter than the
time scales of many important protein processes, e.g., folding.
2
Many attempts have been made (e.g., refs 3, 4) to extend the
time scale of molecular dynamics simulations, but as yet, no
satisfactory solution has been found.
One possible approach to reduce the number of degrees of
freedom is principal component analysis (PCA), which is also
known as the Karhunen-Loeve expansion
5
in time series
analysis. A detailed implementation of the method in the current
context can be found in ref 6, and its use has been recently
reviewed in ref 7. This method was introduced to the protein
research community under the name “quasi-harmonic analysis”
by Karplus and co-workers
8-10
to facilitate the computation of
configurational entropies. The method has also been employed
with the hope of describing molecular dynamics trajectories in
terms of a small number of variables, sometimes called essential
degrees of freedom.
11-15
It appears natural to use the PCA method to reduce the phase
space of proteins for long time molecular dynamics. For this
purpose one may determine a small number of important modes
by PCA and project the equations of motion on the resulting
low-dimensional vector space. The idea has been successfully
applied to reduce the dynamics of hydrodynamic systems.
16
In
this case a partial differential equation is reduced to a set of a
small number of ordinary differential equations. This success
strengthens one’s expectation that PCA could serve as a useful
method for protein dynamics reduction.
The purpose of this paper is to demonstrate that PCA is not
suitable for long time protein dynamics reduction with the
computational power available at present or in the foreseeable
future. The PCA method works superbly in hydrodynamics,
because in this case the computationally obtained correlation
function is very accurate due to a sufficiently long sampling
time. In contrast, PCA for proteins is problematic, since the
longest relaxation time τ for many proteins is on the order of
microseconds or longer.
17
The error in the correlation matrix
for a mode of time scale τ is on the order
x
τ/t
tot
, where t
tot
is
the sampling time. Due to these errors in slow modes, the fast
modes, for which
x
τ/t
tot
is very small, are contaminated and
also become unreliable. The difficulties due to the shortness
of simulation have been clearly pointed out by Hodel et al.
18
in
the context of free energy calculations and by Clarage et al.
19
and Faure et al.
20
in the context of incoherent light scattering
experiments. In a comparison of light-scattering patterns derived
from PCA and normal mode analysis, the latter method gives
better agreement with experiment.
20
Our aim here is to shed
light on the difficulties of PCA to extract essential modes from
a MD simulation, which can be employed for long time
descriptions of proteins.
How can we reconcile the difficulties of PCA observed in
this paper and apparent successes in analyzing protein dynamics
reported by other authors?
11-15
If a particular large-scale motion
arises in a short simulation, PCA surely captures the feature.
However, such features may be detectable just as well without
PCA, e.g., through atomic root mean square (rms) fluctuations.
A nontrivial use of PCA has been successful in describing the
fast motions relevant to NMR spectroscopy.
21
As will be
discussed later, however, this success does not justify the
application of PCA to describe long time dynamics of proteins.
In section II we will outline PCA and its difficulties in
conjunction with long time dynamics. In section III, the
application of the PCA method is illustrated through simulations
of G-actin. In section IV we explain analytically why PCA
does not yield reliable slow modes. Section V provides
concluding remarks.
II. Principal Component Analysis and Its Difficulty
Let qb(t) ) (q
1
(t),q
2
(t),...,q
3N
(t))′ be the coordinates of N atoms
in a protein at time t, where ′ denotes transposition. The basic
idea of PCA is to seek the least-squares approximant for the
position vector in the form
where M is the number of relevant degrees of freedom and the
vectors ηb
R
are orthonormal basis vectors, which are determined
by the eigenvalue problem
Here, K
T
is the correlation matrix determined by the molecular
dynamics trajectory qb(t):
X
Abstract published in AdVance ACS Abstracts, February 1, 1996.
qb(t) =
∑
R)1
M
a
R
(t)ηb
R
(2.1)
λ
R
ηb
R
) K
T
ηb
R
(2.2)
2567J. Phys. Chem. 1996, 100, 2567-2572
0022-3654/96/20100-2567$12.00/0 1996 American Chemical Society
+ +
+ +
M. A. Balsera, W. Wriggers, Y. Oono, and K. Schulten*
Department of Physics and Beckman Institute, UniVersity of Illinois at Urbana-Champaign,
405 North Mathews AVenue, Urbana, Illinois 61801
ReceiVed: December 12, 1995; In Final Form: December 21, 1995
X
It has been suggested that principal component analysis can identify slow modes in proteins and, thereby,
facilitate the study of long time dynamics. However, sampling errors due to finite simulation times preclude
the identification of slow modes that can be used for this purpose. This is demonstrated numerically with the
aid of simulations of the protein G-actin and analytically with the aid of a model which is exactly recoverable
by principal component analysis. Although principal component analysis usually demonstrates that a set of
a small number of modes captures the majority of the fluctuations, the set depends on the particular sampling
time window and its width.
I. Introduction
Molecular dynamics has become an important tool to study
proteins.
1,2
One of the major obstacles limiting its usefulness
is the shortness of achievable simulation times. These times,
on the order of a few nanoseconds, are much shorter than the
time scales of many important protein processes, e.g., folding.
2
Many attempts have been made (e.g., refs 3, 4) to extend the
time scale of molecular dynamics simulations, but as yet, no
satisfactory solution has been found.
One possible approach to reduce the number of degrees of
freedom is principal component analysis (PCA), which is also
known as the Karhunen-Loeve expansion
5
in time series
analysis. A detailed implementation of the method in the current
context can be found in ref 6, and its use has been recently
reviewed in ref 7. This method was introduced to the protein
research community under the name “quasi-harmonic analysis”
by Karplus and co-workers
8-10
to facilitate the computation of
configurational entropies. The method has also been employed
with the hope of describing molecular dynamics trajectories in
terms of a small number of variables, sometimes called essential
degrees of freedom.
11-15
It appears natural to use the PCA method to reduce the phase
space of proteins for long time molecular dynamics. For this
purpose one may determine a small number of important modes
by PCA and project the equations of motion on the resulting
low-dimensional vector space. The idea has been successfully
applied to reduce the dynamics of hydrodynamic systems.
16
In
this case a partial differential equation is reduced to a set of a
small number of ordinary differential equations. This success
strengthens one’s expectation that PCA could serve as a useful
method for protein dynamics reduction.
The purpose of this paper is to demonstrate that PCA is not
suitable for long time protein dynamics reduction with the
computational power available at present or in the foreseeable
future. The PCA method works superbly in hydrodynamics,
because in this case the computationally obtained correlation
function is very accurate due to a sufficiently long sampling
time. In contrast, PCA for proteins is problematic, since the
longest relaxation time τ for many proteins is on the order of
microseconds or longer.
17
The error in the correlation matrix
for a mode of time scale τ is on the order
x
τ/t
tot
, where t
tot
is
the sampling time. Due to these errors in slow modes, the fast
modes, for which
x
τ/t
tot
is very small, are contaminated and
also become unreliable. The difficulties due to the shortness
of simulation have been clearly pointed out by Hodel et al.
18
in
the context of free energy calculations and by Clarage et al.
19
and Faure et al.
20
in the context of incoherent light scattering
experiments. In a comparison of light-scattering patterns derived
from PCA and normal mode analysis, the latter method gives
better agreement with experiment.
20
Our aim here is to shed
light on the difficulties of PCA to extract essential modes from
a MD simulation, which can be employed for long time
descriptions of proteins.
How can we reconcile the difficulties of PCA observed in
this paper and apparent successes in analyzing protein dynamics
reported by other authors?
11-15
If a particular large-scale motion
arises in a short simulation, PCA surely captures the feature.
However, such features may be detectable just as well without
PCA, e.g., through atomic root mean square (rms) fluctuations.
A nontrivial use of PCA has been successful in describing the
fast motions relevant to NMR spectroscopy.
21
As will be
discussed later, however, this success does not justify the
application of PCA to describe long time dynamics of proteins.
In section II we will outline PCA and its difficulties in
conjunction with long time dynamics. In section III, the
application of the PCA method is illustrated through simulations
of G-actin. In section IV we explain analytically why PCA
does not yield reliable slow modes. Section V provides
concluding remarks.
II. Principal Component Analysis and Its Difficulty
Let qb(t) ) (q
1
(t),q
2
(t),...,q
3N
(t))′ be the coordinates of N atoms
in a protein at time t, where ′ denotes transposition. The basic
idea of PCA is to seek the least-squares approximant for the
position vector in the form
where M is the number of relevant degrees of freedom and the
vectors ηb
R
are orthonormal basis vectors, which are determined
by the eigenvalue problem
Here, K
T
is the correlation matrix determined by the molecular
dynamics trajectory qb(t):
X
Abstract published in AdVance ACS Abstracts, February 1, 1996.
qb(t) =
∑
R)1
M
a
R
(t)ηb
R
(2.1)
λ
R
ηb
R
) K
T
ηb
R
(2.2)
2567J. Phys. Chem. 1996, 100, 2567-2572
0022-3654/96/20100-2567$12.00/0 1996 American Chemical Society
+ +
+ +
Page 2
where 〈 〉
t
tot
denotes the time average over 0 e t e t
tot
. The
eigenvalue λ
R
is the time average of |a
R
(t)|
2
.
The hope is to obtain a set of equations of motion for a small
number of modes a
R
(t) corresponding to the largest eigenvalues
λ
R
which provide a reduced representation of the long time
dynamics of a protein. If one knew the correct basis vectors
ηb
R
, then one could obtain the equations with the aid of the
(nonlinear) Galerkin
22
method, as done in fluid dynamics.
16
However, the shortness of the sampling time prevents the
identification of suitable basis vectors ηb
R
. The longest time
scale τ of a protein scales as τ ∼ n
â
, where n is the number of
amino acid residues. The exponent is of order unity. It cannot
be shorter than 2/3, i.e., the exponent which applies to diffusive
relaxation of modes in a solid body. Probably the upper bound
is given by a single-chain polymer in a good solvent with â =
1.67,
23
which may be interpreted as the fluctuation time scale
of the completely unfolded protein conformation. According
to NMR studies, the longest relaxation time cannot be smaller
than 1 µs for a protein of the size n ≈ 300.
17
Thus, for many
proteins the necessary sampling time must be extremely long.
The longest simulation time realistically accessible today, even
for a modest sized protein as G-actin, is a few nanoseconds.
Thus, slow motions cannot be properly identified.
Even if we could manage to obtain the data for at least one
relaxation time of the slowest mode, the existence of fluctuations
prohibits an accurate determination of the correlations between
the slow modes. Consequently, one cannot obtain reliable
collective coordinates for them and, thus, one cannot use PCA
to reduce protein dynamics without spoiling the most important
slow motion.
There remains some hope that the PCA method, though not
capable of extracting long time scale modes, may capture
properly the modes with relaxation times well within the
simulation time window. We will show below that this is not
the case, either.
III. Principal Component Analysis of G-Actin
To illustrate the difficulties one encounters in the application
of PCA to protein dynamics, we simulate the protein G-actin
(globular actin). G-actin is the monomeric form of the
polymeric protein system F-actin (filamentous actin), which is
an important component of the cell cytoskeleton, where it is
involved in cellular motility.
24-26
F-actin also forms the major
component of the thin filament of muscle tissue.
27,28
The crystal
structure of G-actin has been resolved by Kabsch and co-workers
in 1990.
29
G-actin’s 375 amino acid residues are organized in
four subdomains, and in the present study the whole protein
was simulated with explicit water molecules, a bound adenosine
diphosphate nucleotide, and an associated calcium ion. It should
be noted that a normal mode analysis of G-actin has been carried
out by Tirion and ben-Avraham.
30
The resulting modes have
been used to reduce the search span for building F-actin
polymers from G-actin monomers.
31
The molecular dynamics simulations were carried out with
X-PLOR,
32
using a 12 cutoff and an all-hydrogen force field,
a 1 fs integration time step, and a dielectric constant ² ) 1.
The protein was solvated in a 5.6 shell of explicit water
molecules (1189 water molecules). The total system size was
9441 atoms. The system was first energy-minimized, then
assigned initial atom velocities according to a Maxwell distribu-
tion, heated up to 300 K in steps of 30 K in a 5 ps time period,
and equilibrated at 300 K for 5 ps. Finally, free molecular
dynamics was performed for 490 ps.
The trajectories have been analyzed after 20 ps of relaxation
in the form of two consecutive periods of 235 ps. Coordinates
were sampled at a 0.1 ps interval. The simulation results will
be described elsewhere.
33
We sampled in our principal component analysis only the
positions of the 375 C
R
carbons resulting in 3N ) 1275 degrees
of freedom. Amadei et al.
13,34
demonstrated that the character
of the fluctuation spectrum and even the identity of the larger
amplitude modes do not change significantly by a reduction to
C
R
atoms.
We subtracted, as done customarily in PCA, the translational
and rotational modes of the whole protein. These modes are
conserved for the whole system and are expected to have very
long relaxation times for the reduced set of variables we
consider. The stated subtraction endows the covariance matrix
with six vanishing eigenvalues associated with the conserved
modes. The eigenvalues are actually not zero, since the
subtraction of the rigid-body modes only involves the C
R
values,
but are much smaller than the next smallest eigenvalues. The
former eigenvalues are on the order of λ ≈ 10
-8
2
, while the
latter eigenvalues measure λ ≈ 10
-4
2
.
Figures 1-3 present the projections a
R
(t) of the trajectory
onto the eigenvectors corresponding to the three largest eigen-
values λ
R
of the correlation matrix. The results demonstrate
that motion along these modes exhibits a relaxation time which
is longer than or comparable to the sampling window of 235
ps. Notice that if the relaxation time of a mode is significantly
longer than the window width, its relaxation cannot be observed,
and the mode does not show up in the PCA result. Conse-
quently, significant modes reflecting the long term dynamics
could be missed.
K
T
) 〈(qb(t) - 〈qb(t)〉
t
tot
)(qb(t) - 〈qb(t)〉
t
tot
)′〉
t
tot
(2.3)
Figure 1. Projection of the trajectory along the largest principal
component and its autocorrelation function.
2568 J. Phys. Chem., Vol. 100, No. 7, 1996 Balsera et al.
+ +
+ +
t
tot
denotes the time average over 0 e t e t
tot
. The
eigenvalue λ
R
is the time average of |a
R
(t)|
2
.
The hope is to obtain a set of equations of motion for a small
number of modes a
R
(t) corresponding to the largest eigenvalues
λ
R
which provide a reduced representation of the long time
dynamics of a protein. If one knew the correct basis vectors
ηb
R
, then one could obtain the equations with the aid of the
(nonlinear) Galerkin
22
method, as done in fluid dynamics.
16
However, the shortness of the sampling time prevents the
identification of suitable basis vectors ηb
R
. The longest time
scale τ of a protein scales as τ ∼ n
â
, where n is the number of
amino acid residues. The exponent is of order unity. It cannot
be shorter than 2/3, i.e., the exponent which applies to diffusive
relaxation of modes in a solid body. Probably the upper bound
is given by a single-chain polymer in a good solvent with â =
1.67,
23
which may be interpreted as the fluctuation time scale
of the completely unfolded protein conformation. According
to NMR studies, the longest relaxation time cannot be smaller
than 1 µs for a protein of the size n ≈ 300.
17
Thus, for many
proteins the necessary sampling time must be extremely long.
The longest simulation time realistically accessible today, even
for a modest sized protein as G-actin, is a few nanoseconds.
Thus, slow motions cannot be properly identified.
Even if we could manage to obtain the data for at least one
relaxation time of the slowest mode, the existence of fluctuations
prohibits an accurate determination of the correlations between
the slow modes. Consequently, one cannot obtain reliable
collective coordinates for them and, thus, one cannot use PCA
to reduce protein dynamics without spoiling the most important
slow motion.
There remains some hope that the PCA method, though not
capable of extracting long time scale modes, may capture
properly the modes with relaxation times well within the
simulation time window. We will show below that this is not
the case, either.
III. Principal Component Analysis of G-Actin
To illustrate the difficulties one encounters in the application
of PCA to protein dynamics, we simulate the protein G-actin
(globular actin). G-actin is the monomeric form of the
polymeric protein system F-actin (filamentous actin), which is
an important component of the cell cytoskeleton, where it is
involved in cellular motility.
24-26
F-actin also forms the major
component of the thin filament of muscle tissue.
27,28
The crystal
structure of G-actin has been resolved by Kabsch and co-workers
in 1990.
29
G-actin’s 375 amino acid residues are organized in
four subdomains, and in the present study the whole protein
was simulated with explicit water molecules, a bound adenosine
diphosphate nucleotide, and an associated calcium ion. It should
be noted that a normal mode analysis of G-actin has been carried
out by Tirion and ben-Avraham.
30
The resulting modes have
been used to reduce the search span for building F-actin
polymers from G-actin monomers.
31
The molecular dynamics simulations were carried out with
X-PLOR,
32
using a 12 cutoff and an all-hydrogen force field,
a 1 fs integration time step, and a dielectric constant ² ) 1.
The protein was solvated in a 5.6 shell of explicit water
molecules (1189 water molecules). The total system size was
9441 atoms. The system was first energy-minimized, then
assigned initial atom velocities according to a Maxwell distribu-
tion, heated up to 300 K in steps of 30 K in a 5 ps time period,
and equilibrated at 300 K for 5 ps. Finally, free molecular
dynamics was performed for 490 ps.
The trajectories have been analyzed after 20 ps of relaxation
in the form of two consecutive periods of 235 ps. Coordinates
were sampled at a 0.1 ps interval. The simulation results will
be described elsewhere.
33
We sampled in our principal component analysis only the
positions of the 375 C
R
carbons resulting in 3N ) 1275 degrees
of freedom. Amadei et al.
13,34
demonstrated that the character
of the fluctuation spectrum and even the identity of the larger
amplitude modes do not change significantly by a reduction to
C
R
atoms.
We subtracted, as done customarily in PCA, the translational
and rotational modes of the whole protein. These modes are
conserved for the whole system and are expected to have very
long relaxation times for the reduced set of variables we
consider. The stated subtraction endows the covariance matrix
with six vanishing eigenvalues associated with the conserved
modes. The eigenvalues are actually not zero, since the
subtraction of the rigid-body modes only involves the C
R
values,
but are much smaller than the next smallest eigenvalues. The
former eigenvalues are on the order of λ ≈ 10
-8
2
, while the
latter eigenvalues measure λ ≈ 10
-4
2
.
Figures 1-3 present the projections a
R
(t) of the trajectory
onto the eigenvectors corresponding to the three largest eigen-
values λ
R
of the correlation matrix. The results demonstrate
that motion along these modes exhibits a relaxation time which
is longer than or comparable to the sampling window of 235
ps. Notice that if the relaxation time of a mode is significantly
longer than the window width, its relaxation cannot be observed,
and the mode does not show up in the PCA result. Conse-
quently, significant modes reflecting the long term dynamics
could be missed.
K
T
) 〈(qb(t) - 〈qb(t)〉
t
tot
)(qb(t) - 〈qb(t)〉
t
tot
)′〉
t
tot
(2.3)
Figure 1. Projection of the trajectory along the largest principal
component and its autocorrelation function.
2568 J. Phys. Chem., Vol. 100, No. 7, 1996 Balsera et al.
+ +
+ +
Sign up today - FREE
Mendeley saves you time finding and organizing research. Learn more
- All your research in one place
- Add and import papers easily
- Access it anywhere, anytime


