Lengthscales and Cooperativity in DNA Bubble Formation
Europhysics Letters (2005)
- DOI: 10.1209/epl/i2005-10543-x
- arXiv: cond-mat/0511128
Available from arxiv.org
or
Abstract
It appears that thermally activated DNA bubbles of different sizes play central roles in important genetic processes. Here we show that the probability for the formation of such bubbles is regulated by the number of soft AT pairs in specific regions with lengths which at physiological temperatures are of the order of (but not equal to) the size of the bubble. The analysis is based on the Peyrard- Bishop-Dauxois model, whose equilibrium statistical properties have been accurately calculated here with a transfer integral approach.
Available from arxiv.org
Page 1
Lengthscales and Cooperativity in DNA Bubble Formation
ar
X
iv
:c
on
d-
m
at
/0
51
11
28
v1
[
co
nd
-m
at.
so
ft]
4
N
ov
20
05
Lengthscales and Cooperativity in DNA Bubble Formation
Z. Rapti1, A. Smerzi2,3, K. Ø. Rasmussen2 and A. R. Bishop2
1 Center for Nonlinear Studies, Los Alamos National Laboratory,
Los Alamos, New Mexico 87545, and School of Mathematics,
Institute for Advanced Study, Einstein Drive 1, Princeton, New Jersey 08540
2 Theoretical Division and Center for Nonlinear Studies,
Los Alamos National Laboratory, Los Alamos, New Mexico 87545 and
3 Istituto Nazionale di Fisica per la Materia BEC-CRS, Universita` di Trento, I-38050 Povo, Italy
C.H. Choi, and A. Usheva
Beth Israel Deaconess Medical Center and Harvard Medical School,
99 Brookline Avenue, Boston, Massachusetts 02215
(Dated: February 2, 2008)
It appears that thermally activated DNA bubbles of different sizes play central roles in important
genetic processes. Here we show that the probability for the formation of such bubbles is regulated
by the number of soft AT pairs in specific regions with lengths which at physiological temperatures
are of the order of (but not equal to) the size of the bubble. The analysis is based on the Peyrard-
Bishop-Dauxois model, whose equilibrium statistical properties have been accurately calculated here
with a transfer integral approach.
The genetic code underlying all forms of life is encoded
in the DNA molecule by the four bases guanine (G),
thymine (T), adenine (A), and cytosine (C) strung along
a sugar-phosphate backbone in a particular sequence.
The four bases are, through hydrogen bonding, pair-
wise complementary (A-T and G-C) allowing the coding
strand and it’s complement to form the characteristic
double helical DNA macromolecule. Although this con-
struct is extraordinarily stable, it is clearly necessary that
the double strands be separated in biological processes,
including gene transcription, where the code is read by
the appropriate protein machinery in the cell. It has
long been an experimental fact [1] that the DNA double-
strand can be thermally destabilized locally to form tem-
porary single stranded “bubbles” in the molecule. This
local melting is made possible by the entropy gained by
transitioning from the very rigid double-strand to the
much more flexible single-strand, which already at biolog-
ically relevant temperatures can balance the energy cost
of breaking a few base pairs. Considering this entropic
effect together with the inherent energetic heterogeneity
– GC base pairs are 25 % more strongly bound than the
AT bases – of a DNA sequence, it is conceivable that
certain regions (subsequences) are more prone to such
thermal destabilization than others: This has been con-
firmed by model calculations as well as experiments. We
have previously argued [2] that such regions may indeed
experimentally coincide with transcription initiation and
regulatory sites. In this way, the DNA molecule may help
initiate its own transcription by containing bubble form-
ing subsequences at the crucial positions in the sequence
where the transcription machinery assembles and engages
its operation. If a robust general link between the forma-
tion of large thermal bubbles and transcription initiation
is sufficiently established, it becomes crucially important
to be able to accurately predict the subsequence of DNA
with propensity for the formation of bubbles of appropri-
ate sizes.
Here we show that the probabilities of finding bubbles
extending over n sites do not depend on a specific DNA
subsequences. Rather, such probabilities depend on the
density of soft A/T base pairs within specific regions of
length L(κ). This characteristic length is of the order of
the size n of the bubble at physiological temperatures,
but it diverges as the DNA melting temperature is ap-
proached. Our results are based on a calculation of the
thermal equilibrium statistical properties of the Peyrard-
Bishop-Dauxois (PBD) model [3, 4] using a transfer inte-
gral operator (TIO) technique. This model constitutes a
very powerful tool to not only predict bubble formation
probability in a given sequence but also to understand the
underlying physical mechanisms [5]. Our previous study
of the PBD model has been performed using Langevin
[2, 6] and Monte Carlo techniques [7]. However, since our
interest is centered on a very small portion of the thermo-
dynamical equilibrium state, namely on the formation of
large bubbles, dynamical and iterative samplings as of-
fered by these methods are not very efficient. Therefore,
we have developed here a semi-analytic approach based
on the TIO [8, 9] that allows us to efficiently calculate
relevant thermodynamical probabilities.
The potential energy of the PBD model, in its simplest
form, reads
E =
N
∑
κ=1
[V (yn) +W (yn, yn−1)] =
N
∑
κ=1
E(yn, yn−1), (1)
where V (yn) = Dn(e−anyn − 1)2, represents the nonlin-
ear hydrogen bonds between the bases. W (yn, yn−1) =
k
2
(
1 + ρe−b(yn+yn−1)
)
(yn − yn−1)2 is the nearest-
neighbor coupling that represents the (nonlinear) stack-
X
iv
:c
on
d-
m
at
/0
51
11
28
v1
[
co
nd
-m
at.
so
ft]
4
N
ov
20
05
Lengthscales and Cooperativity in DNA Bubble Formation
Z. Rapti1, A. Smerzi2,3, K. Ø. Rasmussen2 and A. R. Bishop2
1 Center for Nonlinear Studies, Los Alamos National Laboratory,
Los Alamos, New Mexico 87545, and School of Mathematics,
Institute for Advanced Study, Einstein Drive 1, Princeton, New Jersey 08540
2 Theoretical Division and Center for Nonlinear Studies,
Los Alamos National Laboratory, Los Alamos, New Mexico 87545 and
3 Istituto Nazionale di Fisica per la Materia BEC-CRS, Universita` di Trento, I-38050 Povo, Italy
C.H. Choi, and A. Usheva
Beth Israel Deaconess Medical Center and Harvard Medical School,
99 Brookline Avenue, Boston, Massachusetts 02215
(Dated: February 2, 2008)
It appears that thermally activated DNA bubbles of different sizes play central roles in important
genetic processes. Here we show that the probability for the formation of such bubbles is regulated
by the number of soft AT pairs in specific regions with lengths which at physiological temperatures
are of the order of (but not equal to) the size of the bubble. The analysis is based on the Peyrard-
Bishop-Dauxois model, whose equilibrium statistical properties have been accurately calculated here
with a transfer integral approach.
The genetic code underlying all forms of life is encoded
in the DNA molecule by the four bases guanine (G),
thymine (T), adenine (A), and cytosine (C) strung along
a sugar-phosphate backbone in a particular sequence.
The four bases are, through hydrogen bonding, pair-
wise complementary (A-T and G-C) allowing the coding
strand and it’s complement to form the characteristic
double helical DNA macromolecule. Although this con-
struct is extraordinarily stable, it is clearly necessary that
the double strands be separated in biological processes,
including gene transcription, where the code is read by
the appropriate protein machinery in the cell. It has
long been an experimental fact [1] that the DNA double-
strand can be thermally destabilized locally to form tem-
porary single stranded “bubbles” in the molecule. This
local melting is made possible by the entropy gained by
transitioning from the very rigid double-strand to the
much more flexible single-strand, which already at biolog-
ically relevant temperatures can balance the energy cost
of breaking a few base pairs. Considering this entropic
effect together with the inherent energetic heterogeneity
– GC base pairs are 25 % more strongly bound than the
AT bases – of a DNA sequence, it is conceivable that
certain regions (subsequences) are more prone to such
thermal destabilization than others: This has been con-
firmed by model calculations as well as experiments. We
have previously argued [2] that such regions may indeed
experimentally coincide with transcription initiation and
regulatory sites. In this way, the DNA molecule may help
initiate its own transcription by containing bubble form-
ing subsequences at the crucial positions in the sequence
where the transcription machinery assembles and engages
its operation. If a robust general link between the forma-
tion of large thermal bubbles and transcription initiation
is sufficiently established, it becomes crucially important
to be able to accurately predict the subsequence of DNA
with propensity for the formation of bubbles of appropri-
ate sizes.
Here we show that the probabilities of finding bubbles
extending over n sites do not depend on a specific DNA
subsequences. Rather, such probabilities depend on the
density of soft A/T base pairs within specific regions of
length L(κ). This characteristic length is of the order of
the size n of the bubble at physiological temperatures,
but it diverges as the DNA melting temperature is ap-
proached. Our results are based on a calculation of the
thermal equilibrium statistical properties of the Peyrard-
Bishop-Dauxois (PBD) model [3, 4] using a transfer inte-
gral operator (TIO) technique. This model constitutes a
very powerful tool to not only predict bubble formation
probability in a given sequence but also to understand the
underlying physical mechanisms [5]. Our previous study
of the PBD model has been performed using Langevin
[2, 6] and Monte Carlo techniques [7]. However, since our
interest is centered on a very small portion of the thermo-
dynamical equilibrium state, namely on the formation of
large bubbles, dynamical and iterative samplings as of-
fered by these methods are not very efficient. Therefore,
we have developed here a semi-analytic approach based
on the TIO [8, 9] that allows us to efficiently calculate
relevant thermodynamical probabilities.
The potential energy of the PBD model, in its simplest
form, reads
E =
N
∑
κ=1
[V (yn) +W (yn, yn−1)] =
N
∑
κ=1
E(yn, yn−1), (1)
where V (yn) = Dn(e−anyn − 1)2, represents the nonlin-
ear hydrogen bonds between the bases. W (yn, yn−1) =
k
2
(
1 + ρe−b(yn+yn−1)
)
(yn − yn−1)2 is the nearest-
neighbor coupling that represents the (nonlinear) stack-
Page 2
2ing interaction between adjacent base pairs: it is com-
prised of a harmonic coupling with a state depended cou-
pling constant effectively modeling the change in stiffness
as the double strand is opened (i.e. entropic effects).
This nonlinear coupling results in long-range cooperative
effects, leading to a sharp entropy-driven denaturation
transition [3, 10]. The sum in Eq.(1) is over all base-pairs
of the molecule and yn denotes the relative displacement
from equilibrium bases at the nth base pair. The im-
portance of the heterogeneity of the sequence is incorpo-
rated by assigning different values to the parameters of
the Morse potential, depending on the the base-pair type.
The parameter values we have used are those from Refs.
[11, 12] chosen to reproduce a variety of thermodynamic
properties.
Transfer Integral Method. All equilibrium, thermo-
dynamic properties of the model (1) can be obtained
through the partition function
Z =
∫ N
∏
κ=1
dyne−βE(yn,yn−1)
=
∫ s+κ−1
∏
n=s
dynZκ(s) e−βE(yn,yn−1), (2)
where the notation
Zκ(s) =
∫ N
∏
n6=s,...,s+κ−1
dyn e−βE(yn,yn−1)
has been introduced. β = (kBT )−1 is the Boltzmann
factor. In order to evaluate the partition function (2)
using the TIO method, we first symmetrize e−βE(x,y) by
introducing [10]
S(x, y) = exp
(
−β
2
(V (x) + V (y) + 2W (x, y))
)
= S(y, x).
Here the second equality holds only when x and y corre-
spond to base-pairs of the same kind. Using Eq. (2) the
expression for Zκ(s) is rewritten as
Zκ(s) =
∫
N
∏
n6=s,...,s+κ−1
dynS(yn, yn−1)
×dy0e−
β
2 V (y1)e−
β
2 V (yN), (3)
where open boundary conditions at n = 1, and n = N
have been used. To proceed, a Fredholm integral equa-
tions with a real symmetric kernel
∫
dyS(x, y)φ(y) = λφ(x) (4)
must be solved separately for the A/T and for the G/C
base-pairs.
Since the eigenvalues are orthonormal and the eigen-
functions form a complete basis, Eq.(4) can be used se-
quentially to replace all integrals by matrix multiplica-
tions in Eq. (3). Whenever the sequence heterogeneity
results in a non-symmetric S(x, y), Eq.(4) cannot be used
and we resort to a symmetrization technique, based on
successive introduction of auxiliary integration variables,
as explained in Ref. [13].
As noted, in order to quantify the sequence depen-
dence on DNA’s ability to form bubbles of different sizes,
we have previously monitored the frequency of opening
events using Langevin and Monte Carlo simulation tech-
niques. Since the large openings constitute relatively rare
events such techniques are not efficient (although essen-
tial for evaluating dynamical and non-equilibrium prop-
erties). It is much more effective to imply the proba-
bilities of large bubbles at a given site in the sequence
directly from the thermodynamic distributions using the
TIO. Importantly, we have confirmed below that this
equilibrium approach reproduces the bubble locations ob-
served by Langevin simulations for the same sequences.
This suggests that the bubbles – although large bubbles
are rare events – are governed by equilibrium statistics.
We evaluate the probabilities Pκ(s), for a base-pair
opening spanning κ base-pairs (our operational definition
of a bubble of size κ), starting at base-pair s as
Pκ(s) = Z−1
∫ ∞
t
s+κ−1
∏
n=s
dynZκ(s)e−βE(yn,yn−1), (5)
where t is the separation (which we have taken as 1.5 A˚)
of the double strand above which we define the strand to
be melted.
Numerical Results. Using this technique, we are able
to systematically investigate the relation between a given
sequence containing a (apparently) disordered mixture
of A/T and G/C pairs and the probability of sponta-
neous, thermally activated, bubbles of various sizes. Our
analysis begins with a thorough study of two viral pro-
moter sequences, Adeno major late promoter (AMLP)
and Adeno Associate viral promoter (AAV P5). We have
previously investigated the dependence of the thermally
induced large bubbles in these sequences [2, 6] and found
that the opening profiles obtained through Langevin sim-
ulations of the PBD model agreed remarkably well with
the local denaturation profiles indicated by S1 nuclease
experiments (see Ref. [2] for details). Here we use the
TIO to calculate the probabilities Eq. (5) for the ther-
mal creation of bubbles of size 1,3,7, and 10 base-pairs
for the AdMLP promoter at T = 300 K (Fig. 1). The
significant feature of the sequence is the occurrence of a
TATA-box at base-pair location −30 with 7 consecutive
A/T base-pairs. Around +1 there is rich region contain-
ing ∼ 12 A/T base pairs, which, however, are not located
consecutively since a comparable amount of G/C pairs
are alternately embedded among the A/T pairs. Since
prised of a harmonic coupling with a state depended cou-
pling constant effectively modeling the change in stiffness
as the double strand is opened (i.e. entropic effects).
This nonlinear coupling results in long-range cooperative
effects, leading to a sharp entropy-driven denaturation
transition [3, 10]. The sum in Eq.(1) is over all base-pairs
of the molecule and yn denotes the relative displacement
from equilibrium bases at the nth base pair. The im-
portance of the heterogeneity of the sequence is incorpo-
rated by assigning different values to the parameters of
the Morse potential, depending on the the base-pair type.
The parameter values we have used are those from Refs.
[11, 12] chosen to reproduce a variety of thermodynamic
properties.
Transfer Integral Method. All equilibrium, thermo-
dynamic properties of the model (1) can be obtained
through the partition function
Z =
∫ N
∏
κ=1
dyne−βE(yn,yn−1)
=
∫ s+κ−1
∏
n=s
dynZκ(s) e−βE(yn,yn−1), (2)
where the notation
Zκ(s) =
∫ N
∏
n6=s,...,s+κ−1
dyn e−βE(yn,yn−1)
has been introduced. β = (kBT )−1 is the Boltzmann
factor. In order to evaluate the partition function (2)
using the TIO method, we first symmetrize e−βE(x,y) by
introducing [10]
S(x, y) = exp
(
−β
2
(V (x) + V (y) + 2W (x, y))
)
= S(y, x).
Here the second equality holds only when x and y corre-
spond to base-pairs of the same kind. Using Eq. (2) the
expression for Zκ(s) is rewritten as
Zκ(s) =
∫
N
∏
n6=s,...,s+κ−1
dynS(yn, yn−1)
×dy0e−
β
2 V (y1)e−
β
2 V (yN), (3)
where open boundary conditions at n = 1, and n = N
have been used. To proceed, a Fredholm integral equa-
tions with a real symmetric kernel
∫
dyS(x, y)φ(y) = λφ(x) (4)
must be solved separately for the A/T and for the G/C
base-pairs.
Since the eigenvalues are orthonormal and the eigen-
functions form a complete basis, Eq.(4) can be used se-
quentially to replace all integrals by matrix multiplica-
tions in Eq. (3). Whenever the sequence heterogeneity
results in a non-symmetric S(x, y), Eq.(4) cannot be used
and we resort to a symmetrization technique, based on
successive introduction of auxiliary integration variables,
as explained in Ref. [13].
As noted, in order to quantify the sequence depen-
dence on DNA’s ability to form bubbles of different sizes,
we have previously monitored the frequency of opening
events using Langevin and Monte Carlo simulation tech-
niques. Since the large openings constitute relatively rare
events such techniques are not efficient (although essen-
tial for evaluating dynamical and non-equilibrium prop-
erties). It is much more effective to imply the proba-
bilities of large bubbles at a given site in the sequence
directly from the thermodynamic distributions using the
TIO. Importantly, we have confirmed below that this
equilibrium approach reproduces the bubble locations ob-
served by Langevin simulations for the same sequences.
This suggests that the bubbles – although large bubbles
are rare events – are governed by equilibrium statistics.
We evaluate the probabilities Pκ(s), for a base-pair
opening spanning κ base-pairs (our operational definition
of a bubble of size κ), starting at base-pair s as
Pκ(s) = Z−1
∫ ∞
t
s+κ−1
∏
n=s
dynZκ(s)e−βE(yn,yn−1), (5)
where t is the separation (which we have taken as 1.5 A˚)
of the double strand above which we define the strand to
be melted.
Numerical Results. Using this technique, we are able
to systematically investigate the relation between a given
sequence containing a (apparently) disordered mixture
of A/T and G/C pairs and the probability of sponta-
neous, thermally activated, bubbles of various sizes. Our
analysis begins with a thorough study of two viral pro-
moter sequences, Adeno major late promoter (AMLP)
and Adeno Associate viral promoter (AAV P5). We have
previously investigated the dependence of the thermally
induced large bubbles in these sequences [2, 6] and found
that the opening profiles obtained through Langevin sim-
ulations of the PBD model agreed remarkably well with
the local denaturation profiles indicated by S1 nuclease
experiments (see Ref. [2] for details). Here we use the
TIO to calculate the probabilities Eq. (5) for the ther-
mal creation of bubbles of size 1,3,7, and 10 base-pairs
for the AdMLP promoter at T = 300 K (Fig. 1). The
significant feature of the sequence is the occurrence of a
TATA-box at base-pair location −30 with 7 consecutive
A/T base-pairs. Around +1 there is rich region contain-
ing ∼ 12 A/T base pairs, which, however, are not located
consecutively since a comparable amount of G/C pairs
are alternately embedded among the A/T pairs. Since
Page 3
3−60 −40 −20 0 200
0.02
0.04
bp
P1
a
−60 −40 −20 0 200
0.002
0.004
0.006
0.008
bp
P3
b
−60 −40 −20 0 200
0.0002
0.0004
bp
P7
c
−60 −40 −20 0 200
0.00005
0.0001
bp
P10
d
FIG. 1: Probabilities Pκ of creating bubbles spanning κ =
1, 3, 7, and 10-bp, respectively, for the Adeno major late pro-
moter at T=300K.
A/T base-pairs are more weakly bound (softer) than G/C
pairs, we could reasonably expect that bubbles have a
predominant opening probability in the region -30. This
is indeed the case for small bubbles, Figs. 1a and 1b .
However, surprisingly, this prediction breaks down when
considering bubbles of larger sizes. The corresponding
probability increases around bp +1 (Fig. 1c), up to the
point that, for a bubble of size 10 bp, it becomes the
highest (Fig. 1d). This finding illustrates the strong in-
terplay between the sequence of base-pairs and the size of
the bubble in the thermal activity of DNA (Indeed, it is
likely that bubbles of different sizes may initiate different
genetic processes).
−40 −20 0 200
0.02
0.04
bp
P1
a
−40 −20 0 200
0.002
0.004
0.006
bp
P3
b
−40 −20 0 200
0.001
0.002
bp
P5
c
−40 −20 0 200
0.0001
0.0002
0.0003
bp
P10 d
FIG. 2: Probabilities Pκ of creating bubbles of 1 bp (κ = 1)
length (a), 3 bp (b), 5 bp (c) and 10 bp (d) length, for the
wild (dashed line) and mutant (solid line) P5 promoter.
The replacement of 1 or 2 soft A/T with hard G/C base
pairs in specific regions of the DNA can also hugely affect
the probability for the formation of bubbles of given sizes.
This is illustrated with the AAV P5 promoter (Fig.2).
This sequence regulates the AAV gene expression, and
it has been shown [14] to bind the transcription initia-
tor Yin Yang 1 (YY1) and to be active for TATA-Box
protein (TBP)-independent transcription. The mutation
of this promoter in which the two A/T bases at +1 and
+2 are replaced by two G/C bases, is known to destroy
the binding site for the YY1 initiator and thereby inhibit
transcription. We have previously shown by Langevin
simulations of the PBD model that this mutation also
suppresses the formation of large bubbles around bp +1.
Here we again calculate the probability to obtain bubbles
of various sizes using the TIO. In Fig. 2 we show the
probability of obtaining bubbles of sizes n=1,3,5, and 10
for the wild (dashed line) and the mutated (solid line)
AAV P5 promoter. The mutation causes a dramatic
change in the double strand’s ability to form large bub-
bles at and around the mutated region. However, the
P1 and the P10 probabilities are much less affected by
the mutation. Notice for the wild type P5 AAV pro-
moter, the region around the TATA-box has the largest
probability for forming large bubbles (panel d). It is im-
portant to note that the AAV P5 promoter has four A/T
rich regions: four consecutive A/T’s around position -40:
seven A/T’s around position -30; five A/T’s at the tran-
scription start site +1; six A/T’s around +14, and all
these soft regions are clearly discernible in P1 (panel a).
From these results on the AAV P5 and AdMLP promot-
0 500
0.01
0.02
0.03
bp
P1 a
0 500
0.001
0.002
bp
P4
b
0 500
0.0001
0.0002
0.0003
bp
P7 c
0 500
0.00002
0.00004
0.00006
bp
P10
d
FIG. 3: Probability Pκ for the formation of bubbles of sizes
κ = 1, 4, 7, 10 bps. The sequences are composed of 20 G/C,
5 A/T, 20 G/C followed by different sequences comprising
3,4,5,6,7 A/T alternating with G/C bps. The last 20 bps are
G/C.
ers we can speculate that the occurrence and intensity
of a peak in the bubble probabilities does not depend on
the specific composition of the DNA fragment. Rather,
there is an essential interplay between the content of A/T
0.02
0.04
bp
P1
a
−60 −40 −20 0 200
0.002
0.004
0.006
0.008
bp
P3
b
−60 −40 −20 0 200
0.0002
0.0004
bp
P7
c
−60 −40 −20 0 200
0.00005
0.0001
bp
P10
d
FIG. 1: Probabilities Pκ of creating bubbles spanning κ =
1, 3, 7, and 10-bp, respectively, for the Adeno major late pro-
moter at T=300K.
A/T base-pairs are more weakly bound (softer) than G/C
pairs, we could reasonably expect that bubbles have a
predominant opening probability in the region -30. This
is indeed the case for small bubbles, Figs. 1a and 1b .
However, surprisingly, this prediction breaks down when
considering bubbles of larger sizes. The corresponding
probability increases around bp +1 (Fig. 1c), up to the
point that, for a bubble of size 10 bp, it becomes the
highest (Fig. 1d). This finding illustrates the strong in-
terplay between the sequence of base-pairs and the size of
the bubble in the thermal activity of DNA (Indeed, it is
likely that bubbles of different sizes may initiate different
genetic processes).
−40 −20 0 200
0.02
0.04
bp
P1
a
−40 −20 0 200
0.002
0.004
0.006
bp
P3
b
−40 −20 0 200
0.001
0.002
bp
P5
c
−40 −20 0 200
0.0001
0.0002
0.0003
bp
P10 d
FIG. 2: Probabilities Pκ of creating bubbles of 1 bp (κ = 1)
length (a), 3 bp (b), 5 bp (c) and 10 bp (d) length, for the
wild (dashed line) and mutant (solid line) P5 promoter.
The replacement of 1 or 2 soft A/T with hard G/C base
pairs in specific regions of the DNA can also hugely affect
the probability for the formation of bubbles of given sizes.
This is illustrated with the AAV P5 promoter (Fig.2).
This sequence regulates the AAV gene expression, and
it has been shown [14] to bind the transcription initia-
tor Yin Yang 1 (YY1) and to be active for TATA-Box
protein (TBP)-independent transcription. The mutation
of this promoter in which the two A/T bases at +1 and
+2 are replaced by two G/C bases, is known to destroy
the binding site for the YY1 initiator and thereby inhibit
transcription. We have previously shown by Langevin
simulations of the PBD model that this mutation also
suppresses the formation of large bubbles around bp +1.
Here we again calculate the probability to obtain bubbles
of various sizes using the TIO. In Fig. 2 we show the
probability of obtaining bubbles of sizes n=1,3,5, and 10
for the wild (dashed line) and the mutated (solid line)
AAV P5 promoter. The mutation causes a dramatic
change in the double strand’s ability to form large bub-
bles at and around the mutated region. However, the
P1 and the P10 probabilities are much less affected by
the mutation. Notice for the wild type P5 AAV pro-
moter, the region around the TATA-box has the largest
probability for forming large bubbles (panel d). It is im-
portant to note that the AAV P5 promoter has four A/T
rich regions: four consecutive A/T’s around position -40:
seven A/T’s around position -30; five A/T’s at the tran-
scription start site +1; six A/T’s around +14, and all
these soft regions are clearly discernible in P1 (panel a).
From these results on the AAV P5 and AdMLP promot-
0 500
0.01
0.02
0.03
bp
P1 a
0 500
0.001
0.002
bp
P4
b
0 500
0.0001
0.0002
0.0003
bp
P7 c
0 500
0.00002
0.00004
0.00006
bp
P10
d
FIG. 3: Probability Pκ for the formation of bubbles of sizes
κ = 1, 4, 7, 10 bps. The sequences are composed of 20 G/C,
5 A/T, 20 G/C followed by different sequences comprising
3,4,5,6,7 A/T alternating with G/C bps. The last 20 bps are
G/C.
ers we can speculate that the occurrence and intensity
of a peak in the bubble probabilities does not depend on
the specific composition of the DNA fragment. Rather,
there is an essential interplay between the content of A/T
Page 4
4and G/C base pairs and the size of the bubble being ex-
amined. Understanding the mechanisms regulating the
DNA openings are of great importance for predicting and
engineering DNA processes, and we therefore now con-
sider a series of simple (but experimentally realizable)
DNA sequences where the effects discussed above are re-
produced in detail. Our purpose is to isolate the under-
lying mechanisms. Our five sequences are all composed
of 20 G/C, 5 A/T and 20 G/C base pairs. This is fol-
lowed by a sequence that alternates A/T and G/C base
pairs. We use 3, 4, 5, 6, and 7 A/T base-pairs in the five
sequences. Finally, all five sequences are terminated with
20 G/C base-pairs.
As shown in Fig. 3a, the largest 1 bp opening prob-
ability is localized at +20, a region that contains five
consecutive A/T bases, and is therefore expected to be
more susceptible to open than the region localized around
+50, containing A/T’s alternating with G/C’s. However,
this simple picture changes dramatically as we move to
larger bubbles, Figs. 3b, 3c, and 3d. In all these cases,
the height of the second peak increases as compared to
the peak at +20. With 3 and 4 A/T’s, the peaks saturate
at a value lower than the first peak. However, the height
of the two peaks for the sequence with 5 A/T’s, becomes
equal in P7, and remains so for larger bubbles. The se-
quence with 6 A/T’s shows an inversion of the opening
probability, similar to that observed in the AdMLP se-
quence: at P10 the most probable 10 base-pair opening
occurs around the base-pair location 50. These data in-
dicate that the opening probability of a bubble of a given
length does not trivially depend on the number of consec-
utive A/T’s in the DNA sub-sequence. Instead, bubbles
of sizes n form with higher probabilities in regions where
the number of A/Ts over some characteristic length L(κ)
is higher, even if the A/Ts are mixed with G/C pairs.
To confirm this hypothesis we have extracted the char-
acteristic lengths L(κ) as a function of the bubble size n
from the probability distributions of the simple sequences
considered in Fig. 3. For instance, for κ = 1, Fig. 3a , we
have obtained L(1) = 4 sites, while for κ = 5, Fig. 3b,
we have L(5) = 10 sites. We have therefore considered
the AAV P5 promoter DNA sequence of Fig. 2. Start-
ing from each site s of the sequence, we count the num-
ber Nκ(s) of A/T pairs contained over the corresponding
next L(κ) sites. In Fig. 4a we show the results for bub-
bles of size κ = 1, which can be compared with Fig. 2a
. The small difference between the mutant and the wild
opening probability for the sites around located at 0 is
well reproduced. The difference between the wild and
the mutant sequence is most pronounced for κ = 5, Fig.
4b, which is also in agreement with the TIO calculation
shown in Fig. 2c.
We have also demonstrated that the opening probabil-
ity does not depend on the specific distribution of the
AT pairs contained in the characteristic regions of length
L(κ). This is shown in Fig.5, where we have calculated
−40 −20 0 200
2
4
bp
N1
a
−40 −20 0 20
4
6
8
bp
N5
b
FIG. 4: Number of A/T bps contained in the characteristic
length L(κ), where κ = 1 (top panel), and 5 (bottom panel)
for the wild (dashed line) and mutant (solid line) P5 promoter.
0 50 100 1400
0.00001
0.00002
0.00003
0.00004
0.00005
0.00006
0.00007
bp
P10
FIG. 5: Probability P10 for the formation of a bubble of 10
consecutive bps. The sequences consist of 40 G/C, 5 A/T
and 20 G/C followed by 14 bps containing different random
combinations of 7 A/T and 7 G/C. All sequences end with 40
G/C bps.
with the TIO the probability for formation of bubbles
with size P10 for a sequence where the second A/T rich re-
gions always contains 7 A/T distributed in several differ-
ent combinations, but always over a 20-base region. We
see that independently of the distribution of A/T base-
pairs, the probability of 10 base-pair bubbles is always
largest in the right-most region. Physically, we interpret
this as the nonlinear coherence dominating (smoothing
out the effect of) the base-pair disorder.
In summary, we have developed a semi-analytical tech-
niques (TIO) that allows the efficient prediction of a given
sequence for thermally induced bubbles of given sizes.
We have found that large thermally induced bubbles arise
amined. Understanding the mechanisms regulating the
DNA openings are of great importance for predicting and
engineering DNA processes, and we therefore now con-
sider a series of simple (but experimentally realizable)
DNA sequences where the effects discussed above are re-
produced in detail. Our purpose is to isolate the under-
lying mechanisms. Our five sequences are all composed
of 20 G/C, 5 A/T and 20 G/C base pairs. This is fol-
lowed by a sequence that alternates A/T and G/C base
pairs. We use 3, 4, 5, 6, and 7 A/T base-pairs in the five
sequences. Finally, all five sequences are terminated with
20 G/C base-pairs.
As shown in Fig. 3a, the largest 1 bp opening prob-
ability is localized at +20, a region that contains five
consecutive A/T bases, and is therefore expected to be
more susceptible to open than the region localized around
+50, containing A/T’s alternating with G/C’s. However,
this simple picture changes dramatically as we move to
larger bubbles, Figs. 3b, 3c, and 3d. In all these cases,
the height of the second peak increases as compared to
the peak at +20. With 3 and 4 A/T’s, the peaks saturate
at a value lower than the first peak. However, the height
of the two peaks for the sequence with 5 A/T’s, becomes
equal in P7, and remains so for larger bubbles. The se-
quence with 6 A/T’s shows an inversion of the opening
probability, similar to that observed in the AdMLP se-
quence: at P10 the most probable 10 base-pair opening
occurs around the base-pair location 50. These data in-
dicate that the opening probability of a bubble of a given
length does not trivially depend on the number of consec-
utive A/T’s in the DNA sub-sequence. Instead, bubbles
of sizes n form with higher probabilities in regions where
the number of A/Ts over some characteristic length L(κ)
is higher, even if the A/Ts are mixed with G/C pairs.
To confirm this hypothesis we have extracted the char-
acteristic lengths L(κ) as a function of the bubble size n
from the probability distributions of the simple sequences
considered in Fig. 3. For instance, for κ = 1, Fig. 3a , we
have obtained L(1) = 4 sites, while for κ = 5, Fig. 3b,
we have L(5) = 10 sites. We have therefore considered
the AAV P5 promoter DNA sequence of Fig. 2. Start-
ing from each site s of the sequence, we count the num-
ber Nκ(s) of A/T pairs contained over the corresponding
next L(κ) sites. In Fig. 4a we show the results for bub-
bles of size κ = 1, which can be compared with Fig. 2a
. The small difference between the mutant and the wild
opening probability for the sites around located at 0 is
well reproduced. The difference between the wild and
the mutant sequence is most pronounced for κ = 5, Fig.
4b, which is also in agreement with the TIO calculation
shown in Fig. 2c.
We have also demonstrated that the opening probabil-
ity does not depend on the specific distribution of the
AT pairs contained in the characteristic regions of length
L(κ). This is shown in Fig.5, where we have calculated
−40 −20 0 200
2
4
bp
N1
a
−40 −20 0 20
4
6
8
bp
N5
b
FIG. 4: Number of A/T bps contained in the characteristic
length L(κ), where κ = 1 (top panel), and 5 (bottom panel)
for the wild (dashed line) and mutant (solid line) P5 promoter.
0 50 100 1400
0.00001
0.00002
0.00003
0.00004
0.00005
0.00006
0.00007
bp
P10
FIG. 5: Probability P10 for the formation of a bubble of 10
consecutive bps. The sequences consist of 40 G/C, 5 A/T
and 20 G/C followed by 14 bps containing different random
combinations of 7 A/T and 7 G/C. All sequences end with 40
G/C bps.
with the TIO the probability for formation of bubbles
with size P10 for a sequence where the second A/T rich re-
gions always contains 7 A/T distributed in several differ-
ent combinations, but always over a 20-base region. We
see that independently of the distribution of A/T base-
pairs, the probability of 10 base-pair bubbles is always
largest in the right-most region. Physically, we interpret
this as the nonlinear coherence dominating (smoothing
out the effect of) the base-pair disorder.
In summary, we have developed a semi-analytical tech-
niques (TIO) that allows the efficient prediction of a given
sequence for thermally induced bubbles of given sizes.
We have found that large thermally induced bubbles arise
Page 5
5through a subtle interplay between length scales inher-
ent in the nonlinear dynamics, and the sequence disorder.
Our results provide new understandings that can help to
not only identify new protein coding genes, but also en-
able reverse-engineering for use in future gene therapeu-
tic applications.
This work at Los Alamos National Laboratory is sup-
ported by the US Department of Energy (contract No.
W-7405-ENG-36) and by a NIH grant for A.U. (Grant
Number: R01 GM073911).
[1] M. Gueron, M. Kochoyan, and J.L. Leroy Nature 328,
89 (1987); M. Frank-Kamenetskii Nature 328 17 (1987).
[2] C.H. Choi, G. Kalosakas, K.Ø. Rasmussen, M. Hiromura,
A. R. Bishop and A. Usheva, Nucleic Acids Res. 32, 1584,
(2004).
[3] T. Dauxois, M. Peyrard and A. R. Bishop, Phys. Rev. E
47 R44 (1993).
[4] M. Peyrard and A. R. Bishop, Phys. Rev. Lett, 62, 2755,
(1989).
[5] M. Peyrard, Nonlinearity 17, R1 (2004).
[6] G. Kalosakas, K.Ø. Rasmussen, A. R. Bishop, C.H. Choi,
and A. Usheva, Europhys. Lett. 68, 127, (2004).
[7] S. Ares, N.K. Voulgarakis, K.Ø. Rasmussen and A.R.
Bishop, Phys. Rev. Lett., 94, 035504, (2005).
[8] D. J. Scalapino, M. Sears and R.A. Ferrell, Phys. Rev.
B, 6, 3409, (1972).
[9] Y. Zhang, W.-M. Zheng, J.-X. Liu, and Y. Z. Chen, Phys.
Rev. E, 56, 7100, (1997).
[10] T. Dauxois and M. Peyrard, Phys. Rev. E 51 4027 (1995).
[11] A. Campa and A. Giansanti, Phys. Rev. E, 58, 3585,
(1998).
[12] The parameters were chosen in Ref. [11] to fit thermo-
dynamic properties of DNA: k = 0.025eV/A2, ρ = 2,
β = 0.35A−1 for the inter-site coupling; for the Morse
potential DGC = 0.075eV , aGC = 6.9A−1 for a G-C bp,
DAT = 0.05eV , aAT = 4.2A−1 for the A-T bp.
[13] M. B. Fogel, Nonlinear Order Parameter Fields: I. Soli-
ton Dynamics, II. Thermodynamics of a Model Impure
System, Ph.D. Thesis, Cornel University, (1977).
[14] A. Usheva and Shenk, Cell 76, 1115 (1994)
ent in the nonlinear dynamics, and the sequence disorder.
Our results provide new understandings that can help to
not only identify new protein coding genes, but also en-
able reverse-engineering for use in future gene therapeu-
tic applications.
This work at Los Alamos National Laboratory is sup-
ported by the US Department of Energy (contract No.
W-7405-ENG-36) and by a NIH grant for A.U. (Grant
Number: R01 GM073911).
[1] M. Gueron, M. Kochoyan, and J.L. Leroy Nature 328,
89 (1987); M. Frank-Kamenetskii Nature 328 17 (1987).
[2] C.H. Choi, G. Kalosakas, K.Ø. Rasmussen, M. Hiromura,
A. R. Bishop and A. Usheva, Nucleic Acids Res. 32, 1584,
(2004).
[3] T. Dauxois, M. Peyrard and A. R. Bishop, Phys. Rev. E
47 R44 (1993).
[4] M. Peyrard and A. R. Bishop, Phys. Rev. Lett, 62, 2755,
(1989).
[5] M. Peyrard, Nonlinearity 17, R1 (2004).
[6] G. Kalosakas, K.Ø. Rasmussen, A. R. Bishop, C.H. Choi,
and A. Usheva, Europhys. Lett. 68, 127, (2004).
[7] S. Ares, N.K. Voulgarakis, K.Ø. Rasmussen and A.R.
Bishop, Phys. Rev. Lett., 94, 035504, (2005).
[8] D. J. Scalapino, M. Sears and R.A. Ferrell, Phys. Rev.
B, 6, 3409, (1972).
[9] Y. Zhang, W.-M. Zheng, J.-X. Liu, and Y. Z. Chen, Phys.
Rev. E, 56, 7100, (1997).
[10] T. Dauxois and M. Peyrard, Phys. Rev. E 51 4027 (1995).
[11] A. Campa and A. Giansanti, Phys. Rev. E, 58, 3585,
(1998).
[12] The parameters were chosen in Ref. [11] to fit thermo-
dynamic properties of DNA: k = 0.025eV/A2, ρ = 2,
β = 0.35A−1 for the inter-site coupling; for the Morse
potential DGC = 0.075eV , aGC = 6.9A−1 for a G-C bp,
DAT = 0.05eV , aAT = 4.2A−1 for the A-T bp.
[13] M. B. Fogel, Nonlinear Order Parameter Fields: I. Soli-
ton Dynamics, II. Thermodynamics of a Model Impure
System, Ph.D. Thesis, Cornel University, (1977).
[14] A. Usheva and Shenk, Cell 76, 1115 (1994)
Sign up today - FREE
Mendeley saves you time finding and organizing research. Learn more
- All your research in one place
- Add and import papers easily
- Access it anywhere, anytime
Start using Mendeley in seconds!
Readership Statistics
5 Readers on Mendeley
by Discipline
80% Physics
by Academic Status
60% Post Doc
20% Ph.D. Student
20% Researcher (at a non-Academic Institution)
by Country
40% United States
40% Germany
20% Ireland


