Sign up & Download
Sign in

Inferring protein-DNA dependencies using motif alignments and mutual information.

by Shaun Mahony, Philip E Auron, Panayiotis V Benos
Bioinformatics (2007)

Abstract

MOTIVATION: Mutual information can be used to explore covarying positions in biological sequences. In the past, it has been successfully used to infer RNA secondary structure conformations from multiple sequence alignments. In this study, we show that the same principles allow the discovery of transcription factor amino acids that are coevolving with nucleotides in their DNA-binding targets. RESULTS: Given an alignment of transcription factor binding domains, and a separate alignment of their DNA target motifs, we demonstrate that mutually covarying base-amino acid positions may indicate possible protein-DNA contacts. Examples explored in this study include C2H2 zinc finger, homeodomain and bHLH DNA-binding motif families, where a number of known base-amino acid contacting positions are identified. Mutual information analyses may aid the prediction of base-amino acid contacting pairs for particular transcription factor families, thereby yielding structural insights from sequence information alone. Such inference of protein-DNA contacting positions may guide future experimental studies of DNA recognition.

Cite this document (BETA)

Available from Shaun Mahony's profile on Mendeley.
Page 1
hidden

Inferring protein-DNA dependencies using motif alignments and mutual information.

Vol. 23 ISMB/ECCB 2007, pages i297–i304BIOINFORMATICS doi:10.1093/bioinformatics/btm215
Inferring protein–DNA dependencies using motif alignments
and mutual information
Shaun Mahony1,*, Philip E. Auron2,3 and Panayiotis V. Benos1,4,5,*
1Department of Computational Biology, 2Department of Molecular Genetics and Biochemistry, School of Medicine,
University of Pittsburgh, 3Department of Biological Sciences, Duquesne University, 4Department of Human Genetics,
Graduate School of Public Health and 5University of Pittsburgh Cancer Institute, School of Medicine, University of
Pittsburgh, Pittsburgh, USA
ABSTRACT
Motivation: Mutual information can be used to explore covarying
positions in biological sequences. In the past, it has been
successfully used to infer RNA secondary structure conformations
from multiple sequence alignments. In this study, we show that the
same principles allow the discovery of transcription factor amino
acids that are coevolving with nucleotides in their DNA-binding
targets.
Results: Given an alignment of transcription factor binding domains,
and a separate alignment of their DNA target motifs, we demonstrate
that mutually covarying base-amino acid positions may indicate
possible protein–DNA contacts. Examples explored in this study
include C2H2 zinc finger, homeodomain and bHLH DNA-binding
motif families, where a number of known base-amino acid contacting
positions are identified. Mutual information analyses may aid the
prediction of base-amino acid contacting pairs for particular
transcription factor families, thereby yielding structural insights
from sequence information alone. Such inference of protein–DNA
contacting positions may guide future experimental studies of DNA
recognition.
Contact: shaun.mahony@ccbb.pitt.edu or benos@pitt.edu
1 INTRODUCTION
Transcription factor (TF) proteins recognize their DNA targets
via the formation of a network of specific and non-specific
molecular interactions. TF DNA-binding preferences are
usually modeled using frequency matrices derived from
alignments of known sites. Typically, these position-specific
scoring matrices (PSSMs) assume independence between the
base positions (Stormo, 2000). Structurally related TFs often
share similarities in their DNA-binding motifs. Generalized
binding models or familial binding profiles (FBPs) constitute a
measure of the ‘average’ binding specificity for a family of TFs
(Sandelin and Wasserman, 2004). Structural information and
protein sequence comparisons have been previously used to
cluster TF binding profiles in order to build FBPs (Sandelin
and Wasserman, 2004), and automatic methods have been
recently introduced (Mahony et al., 2007). FBPs allow DNA
pattern discovery algorithms to be biased towards a particular
TF structural class (Mahony et al., 2005). In addition, FBPs
can be used to infer the identity of the TF family for predicted
novel motifs (Mahony et al., 2007; Sandelin and Wasserman,
2004), or to remove degeneracy between related motifs in the
motif repositories (Cartharius et al., 2005).
In this study, we use FBP construction methods to define
alignments of related DNA-binding motifs. Given an alignment
of DNA-binding motifs from a family of related TFs, and
a separate alignment of their corresponding DNA-binding
domain sequences, we demonstrate that mutual information can
be calculated for each pair of positions between the alignments.
Positions of high covariance are shown to correspond to TF
residues that have a critical effect on DNA recognition. We
demonstrate the effectiveness of this method using C2H2 zinc
finger, homeodomain and basic helix-loop-helix (bHLH)
binding domain DNA motifs, where known protein–DNA
contacting positions are recovered using sequence information
alone. The prediction of nucleotide–amino acid contacting
potential from sequence data alone is invaluable in directing
mutagenic experimentation for elucidating mechanisms of TF-
DNA recognition. As demonstrated in this article, mutual
information analyses can certainly play a role in such
predictions.
2 METHODS
2.1 Comparing PSSM columns
A PSSM model of length L is comprised of a set of 4L weights
(columns). Each column, X, follows a probability distribution,
fpXðbÞgb2fA,C,G,Tg, with the base probability values reflecting the binding
preference of the TF to the corresponding base in this position.
The probability values can be estimated from the observed
base counts, fnXðbÞgb2fA,C,G,Tg. We denote the estimated values
fðXÞ ¼ ffXðbÞgb2fA,C,G,Tg. In practice, pX are estimated from nX plus
some pseudocounts to reduce small sample biases and to avoid zero
probabilities. The assumption of independence between positions
is not entirely accurate, but acts as a useful approximation
(Benos et al., 2002a).
The Pearson Correlation Coefficient (PCC) has been previously
used by us and others to compare DNA motif columns (Benos et al.,
2002a; Hughes et al., 2000; Mahony et al., 2005), and gives a measure
of agreement between two (unweighted) sets of observations by
means of their covariance. PCC is defined by:
PCCðX,YÞ ¼
PT
b¼A fXðbÞ  fX
 
 fYðbÞ  fY
 
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiPT
b¼A fXðbÞ  fX
 2 PTb¼A fYðbÞ  fY
 2q ð1Þ
*To whom correspondence should be addressed.
 2007 The Author(s)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/
by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
Page 2
hidden
We have recently found the PCC metric to have superior DNA motif
alignment performance over alternatives (Mahony et al., 2007).
2.2 Comparing motifs of different lengths: P-values
A dataset of 10 000 simulated PSSMs reflecting the properties of
the PSSM models in the JASPAR database was constructed
as described in the following web site: http://forkhead2.cgb.ki.se/
jaspar/additional/index.htm. Sandelin and Wasserman’s method
(Sandelin and Wasserman, 2004) was then used for the calculation of
empirical P-values that are independent of the length of the compared
motifs. In this method, the alignment scores observed between all
possible pairings of the simulated PSSMs are grouped according to
the lengths of the paired matrices. Probability distributions specific
to pairs of matrices of any given length are thus constructed and allow
calculation of the probability that an observed similarity score is no
better than that of a pair of random PSSMs of the same lengths.
2.3 Pairwise and multiple motif alignment and
tree-building methods
An ungapped, extended Smith–Waterman local alignment strategy
(Smith and Waterman, 1981) is used in this study, where the ‘motif
cores’ of the PSSM models under comparison are aligned before
extending the local alignment. The ‘core’ is defined as the longest of
(a) the four most informative adjacent columns and (b) the ‘trimmed’
motif (starting and ending at a position with information content
at least 0.3). Optimal alignment is sought in both forward/reverse
motif directions.
Iterative refinement is used as the multiple alignment strategy,
and aims to combat the common problem of local minima due
to ‘frozen’ subalignments (Barton and Sternberg, 1987). Iterative
refinement builds a rough multiple alignment by progressively
adding to the current alignment the most similar input PSSM.
Once the initial alignment is built, each PSSM is removed from the
alignment in turn and realigned to a profile of the other aligned
sequences. Iteration of the realignment continues a fixed number
of times.
The trees constructed for the homeodomain and basic region
examples are built using a UPGMA algorithm, where the distances
between motifs are derived from the similarity P-values. All pairwise
alignment, multiple alignment and tree-building algorithms employed
in this study are accessible from the STAMP web-platform (http://
www.benoslab.pitt.edu/stamp).
2.4 Mutual information
Mutual information (i.e. covariance dependency) has long been used
as an aid to RNA secondary structure prediction, allowing
the detection of pairs of codependant columns in an alignment of
RNA sequences (Chiu and Kolodziejczak, 1991; Gutell et al., 1992).
In this study, we demonstrate that mutual information analysis of
DNA motif multiple alignments may assist in the prediction of
protein positions that affect DNA binding at particular base
positions. The mutual information, Mij, between a DNA motif
multiple alignment column and a protein alignment column is
defined as:
Mij ¼
XT
ib¼A
XY
ja¼A fib, ja  log2
fib, ja
fib  fja
, ð2Þ
where fib is the observed frequency of base b (b2{A,C,G,T}) in
column i of the DNA alignment, fja is the frequency of amino
acid a (a2{A,C,D,. . .,Y}) in column j of the protein alignment and
fib, ja is the joint (pairwise) frequency of this base-amino acid
position combination. A multiple alignment of related DNA-binding
motifs may be constructed using the methods described above.
Given a multiple alignment of the corresponding DNA-contacting
domain protein sequences, the mutual information between
positions in the proteins and their DNA targets may be calculated.
The protein positions that exhibit high mutual information for
one or more base positions are more likely to be involved in
the binding mechanism; either by directly contacting the correspond-
ing bases or indirectly, e.g. by stabilizing a ‘core’ of contacting
amino acids.
2.5 Limitations of mutual information analysis
Low mutual information values should be treated with caution.
Low scores suggest that the corresponding base and amino acid
positions show no codependence only if both positions are varying
independently. Naturally, covariance cannot be used to measure
anything useful if one or both positions are invariant. These cases
should be treated as ‘missing values’ rather than ‘no co-dependence’.
On the other hand, high mutual information values may indicate
covariance only if both positions have sufficient examples to
provide statistical significance. For example, we may easily imagine
the extreme scenario where four aligned protein sequences contain
different amino acids in a particular position. This position will
show ‘high’ mutual information value if the four amino acids
happen to pair with different nucleotides. In such a case, however,
the ‘co-variance’ would be entirely coincidental. We ideally want the
number of observed pairs (x) to be high, as larger numbers of
examples will allow us to distinguish between true and coincidental
covariance.
We can use simulations to measure the extent to which coincidental
covariance could occur. To do this, 10 000 sets of x random base/amino
acid pairs were generated, and mutual information scores were
calculated for each set. For varying x, the average proportion of the
random sets that produce a mutual information score of less than
0.5 (an arbitrary low threshold) is displayed in Figure 1. As may be seen
from the figure, 140 base/amino acid pairs are required before the
chance of randomly receiving a mutual information score greater than
0.5 falls below 1%. In the EGR zinc finger example discussed below,
x¼ 3099 for the coalesced set after separating each of the three zinc
fingers and their DNA target, so this set obviously passes the
significance threshold.
Note that the above simulations and associated significance threshold
of 140 pairs are applicable only to those cases where single amino acids
are paired with single bases. In the general case, where the target
Fig. 1. Proportion of mutual information scores 0.5 for random
base/amino acid pairs.
S.Mahony et al.
i298

Sign up today - FREE

Mendeley saves you time finding and organizing research. Learn more

  • All your research in one place
  • Add and import papers easily
  • Access it anywhere, anytime

Start using Mendeley in seconds!

Already have an account? Sign in

Readership Statistics

13 Readers on Mendeley
by Discipline
 
 
 
by Academic Status
 
31% Post Doc
 
31% Ph.D. Student
 
15% Assistant Professor
by Country
 
31% United Kingdom
 
15% United States
 
8% Australia