Sign up & Download
Sign in

Bacterial protein structures reveal phylum dependent divergence.

by Matthew D Shortridge, Thomas Triplet, Peter Revesz, Mark A Griep, Robert Powers
Computational Biology and Chemistry (2011)

Abstract

Protein sequence space is vast compared to protein fold space. This raises important questions about how structures adapt to evolutionary changes in protein sequences. A growing trend is to regard protein fold space as a continuum rather than a series of discrete structures. From this perspective, homologous protein structures within the same functional classification should reveal a constant rate of structural drift relative to sequence changes. The clusters of orthologous groups (COG) classification system was used to annotate homologous bacterial protein structures in the Protein Data Bank (PDB). The structures and sequences of proteins within each COG were compared against each other to establish their relatedness. As expected, the analysis demonstrates a sharp structural divergence between the bacterial phyla Firmicutes and Proteobacteria. Additionally, each COG had a distinct sequence/structure relationship, indicating that different evolutionary pressures affect the degree of structural divergence. However, our analysis also shows the relative drift rate between sequence identity and structure divergence remains constant.

Cite this document (BETA)

Available from www.ncbi.nlm.nih.gov
Page 1
hidden

Bacterial protein structures reveal phylum dependent divergence.

Computational Biology and Chemistry 35 (2011) 24–33
Contents lists available at ScienceDirect
Computational Biology and Chemistry
journa l homepage: www.e lsev ier .com/ lo
Research Article
Bacterial protein structures reveal phylum depen
Matthew D. Shortridgea, Thomas Tripletb,1, Peter Reveszb, Ma
a Department of Chemistry, University of Nebraska-Lincoln, 722 Hamilton Hall, Lincoln, NE 68588-0304, Un
b Department of Computer Science and Engineering, University of Nebraska-Lincoln, Lincoln, NE 68588-011
a r t i c l
Article history:
Received 27 Ju
Received in re
28 December
Accepted 29 D
Keywords:
Proteins
Structure
Sequence
Function
Evolution
ed to
ges in
ries o
ional
sters
tein
G we
es a s
each
affec
een s
1. Introdu
Quantifiable models of protein evolution are useful for develop-
ing robust tools to identify suitable drug-binding sites, to predict
increases in susceptibility to a human genetic disease, and to
predict and modify organismal niches. Some of the strongest argu-
ments in fav
evolution u
ple sequenc
relationship
sequence va
tein evoluti
selected an
ture (Pal et
What is
undergoes
fundament
Abbreviati
Groups; PDB,
pattern based
pattern with o
genetic patter
ZAB, Dali Z-sco
sequence simi
∗ Correspon
E-mail add
1 Present ad
versity, Montr
g th
tural changes (Chothia and Lesk, 1986; Rost, 1999). The resulting
observation is that sequence determines a protein’s structure, but
the structure is relatively invariant over a large range of sequences.
This is highlighted by the tremendous difference between the num-
ber of known protein structures versus protein folds (Sadreyev and
1476-9271/$ –
doi:10.1016/j.or of biological evolution draw from studies on protein
sing sequence homology (Do and Katoh, 2008). Multi-
e alignments are routinely used to create phylogenetic
s (Chang et al., 2008; Feng, 2007), which highlights
riability between organisms. The accepted view of pro-
on is that changes to the protein’s gene sequence are
d modulated by a number of factors that includes struc-
al., 2006; Rocha, 2006).
the impact on protein structure as its sequence
genetic drift? Maintaining the correct protein fold is
al to preserving its function (Forouhar et al., 2007), but
ons: FSS, fractional Structure Similarity; COG, Cluster of Orthologous
Protein Data Bank; Split, clusters showing strong phylogenetic split
on structure; Split + 1, clusters showing strong phylogenetic split
ne outlier based on structure; Starburst, clusters with variable phylo-
ns based on structure; ZAA and ZBB, Dali Z-scores for self comparisons;
res for pairwise comparisons; FSS, structure similarity ratio; SeqID,
larity ratio.
ding author. Tel.: +1 402 472 3039; fax: +1 402 472 9402.
ress: rpowers3@unl.edu (R. Powers).
dress: Centre for Structural and Functional Genomics, Concordia Uni-
eal, QC, Canada H4B-1R6.
Grishin, 2006). Even though the Protein Data Bank (PDB) (Berman
et al., 2000) contains 66,083 protein structures as of June 22, 2010,
there are only 1233 unique topologies and 1195 unique folds in the
CATH (Orengo et al., 1997) and SCOP (Murzin et al., 1995) struc-
ture classification databases, respectively. The significant reduction
in the number of protein folds relative to the number of protein
sequences implies a much stronger correlation between structure
and function. Correspondingly, protein structures are generally
viewed as more conserved relative to its sequence and recent stud-
ies have attempted to quantify this statement (Illergard et al., 2009).
The explicit reason for the reduction in fold space remains
unclear. However, some have suggested that protein fold space may
be more appropriately described as a continuum instead of a collec-
tion of discrete folds (Kolodny et al., 2006). In this manner, a protein
fold should be considered as being plastic, where sequence changes
are accommodated by local perturbations in the structure while
maintaining the general characteristics of a particular fold (Illergard
et al., 2009; Panchenko et al., 2005; Williams and Lovell, 2009). Cor-
respondingly, the genetic drift in a protein’s sequence may imply
a similar gradual divergence in structure instead of a sudden dra-
matic transition to a new fold. From this perspective, a comparative
analysis of homologous proteins should identify correlated rates of
structure and sequence divergence. Previous studies have looked at
see front matter © 2011 Elsevier Ltd. All rights reserved.
compbiolchem.2010.12.004e i n f o
ly 2010
vised form
2010
ecember 2010
a b s t r a c t
Protein sequence space is vast compar
structures adapt to evolutionary chan
space as a continuum rather than a se
tein structures within the same funct
relative to sequence changes. The clu
to annotate homologous bacterial pro
sequences of proteins within each CO
As expected, the analysis demonstrat
cutes and Proteobacteria. Additionally,
that different evolutionary pressures
also shows the relative drift rate betw
ction evolvincate /compbio lchem
dent divergence
rk A. Griepa, Robert Powersa,∗
ited States
5, United States
protein fold space. This raises important questions about how
protein sequences. A growing trend is to regard protein fold
f discrete structures. From this perspective, homologous pro-
classification should reveal a constant rate of structural drift
of orthologous groups (COG) classification system was used
structures in the Protein Data Bank (PDB). The structures and
re compared against each other to establish their relatedness.
harp structural divergence between the bacterial phyla Firmi-
COG had a distinct sequence/structure relationship, indicating
t the degree of structural divergence. However, our analysis
equence identity and structure divergence remains constant.
© 2011 Elsevier Ltd. All rights reserved.
e sequence would also be expected to result in struc-
Page 2
hidden
M.D. Shortridge et al. / Computational Biology and Chemistry 35 (2011) 24–33 25
homologous structure similarity before but the datasets did not try
to show structure divergence consequences on phylogenetic rela-
tionships (Illergard et al., 2009; Panchenko et al., 2005; Williams
and Lovell, 2009). To help understand how protein plasticity affects
organism d
families an
Proteobacte
2. Materia
2.1. COG as
Assignm
ber in the
database re
both datab
Search Tool
was run usi
v0.9.2) (Sch
wise BLAST
databases. T
with a gap
of 5, and a
value was u
with protei
and stored i
and Sequen
2010).
After ma
entry was m
set was then
Specifically
ferent sourc
analyzed fu
2.2. Pairwis
The pai
(Holm and
Athlon AM
ning CentO
matches th
comparison
Proteobacte
job to the
approximat
ilarity of st
was approx
The shel
reported by
on a per cha
chains, whe
structure in
et al., 2010
pairwise str
best structu
correct PDB
assignment
to calculate
by Eq. (1).
FSS =
max(
where ZAB
was the Z-s
was the Z-s
and ZBB represent the Z-score that can be achieved for perfect
similarity.
anua
nual
sign
res
epor
us n
the
from
for e
ning
ation
(htt
/ww
nal a
value
unct
as u
boun
-typ
B to
taba
PDB
earc
48 C
ing 1
ructu
addi
from
ultip
/ub.c
ng a
ent
sus-a
ing te
resid
ns o
ding
r of
d. Th
a ro
cture
ed in
phy
996)
h set
Fitc
was
tor s
matr
gram
ty ru
istan
MMO
tic tr
tes u
c tree
LIP. E
e ana
starbivergence, we compared 48 sets of homologous protein
notated in the COG database for two bacterial phyla,
ria and Firmicutes.
ls and methods
signment of the Protein Data Bank
ent of each bacterial protein in the PDB to a COG num-
clusters of orthologous groups (Tatusov et al., 2003)
quired downloading the complete sequence lists from
ases and running a pairwise Basic Local Alignment
(BLAST) comparison. The pairwise protein BLAST search
ng the Protein Mapping and Comparison Tool (PROMPT
midt and Frishman, 2006) that allowed for large pair-
searching and reported the best match between the two
he BLAST search was run using the BLOSUM62 matrix
penalty of 11, gap extension penalty of 1, a word size
BLAST expectation threshold (E-value) of 10−9. This E-
sed to unambiguously match genes in the COG database
ns in the PDB. All PDB-to-COG matches were reported
n our PROFESS (PROtein Function, Evolution, Structure,
ce) database (http://cse.unl.edu/∼profess/, Triplet et al.,
tching structures to their representative COG, each PDB
atched with its source organism and phylum. The data
filtered according to the number of unique organisms.
, only those COGs with structures from two or more dif-
e organisms in both Proteobacteria and Firmicutes were
rther.
e structure comparison
rwise structure comparison program DaliLite v2.4.2
Park, 2000) was installed on our 16-node Dual
D 2.13 GHz with 1 GB of RAM Beowulf cluster run-
S 4.4 Linux with a 2.25TB RAID array. A C-shell script
e PDB files from each Proteobacteria–Proteobacteria
(−/−), Firmicutes–Firmicutes comparison (+/+) and
ria–Firmicutes comparison (−/+) and then submits the
program DaliLite. Each structural comparison takes
ely 2–10 min, depending on the size and relative sim-
ructures. The total time to run all 63,504 comparisons
imately 7 weeks.
l script extracts all structural comparison information
DaliLite (comparison files, rmsd, %Sequence ID,Z-score)
in basis. A single PDB file may contain multiple protein
re each chain may have a separate COG assignment. All
formation is stored in our PROFESS database (Triplet
), which is parsed to find the largest Z-score for each
ucture comparison. The largest Z-score represents the
re comparison for a pair of proteins and ensures that the
chains were used for the analysis and the correct COG
s were made. All best matches from each COG were used
the Fractional Structure Similarity score (FSS) described
ZAB
ZAA, ZBB)
(1)
was the Z-score for comparing proteins A and B, ZAA
core when protein A was compared to itself and ZBB
core when protein B was compared to itself. Thus, ZAA
2.3. M
Ma
PDB as
structu
were r
onymo
within
carded
match
by ope
inform
COG
(http:/
functio
low E-
were f
ment w
ligand
to wild
The PD
v3.2 da
for the
CATH s
of the
remain
2.4. St
In
tures
the m
(http:/
resulti
alignm
all-ver
strapp
of per-
Colum
respon
numbe
reache
culate
of stru
import
sensus
et al., 1
Eac
by the
matrix
genera
tance
the pro
majori
show d
by MA
logene
replica
geneti
in PHY
DaliLit
fit thel filtering and data analysis
refinement of the dataset included verification of each
ment to a COG and filtering out redundantly solved
from the same organism. When multiple structures
ted from the same organism (or organism with syn-
ame), the structure that gave the largest Dali Z-score
COG was kept while remaining structures were dis-
the analysis. This confirmed a single best PDB-COG
ach organism. Manual refinement was accomplished
all PDB IDs within a COG and checking biological
against the PDB (http://www.rcsb.org/pdb/home),
p://www.ncbi.nlm.nih.gov/COG/) and the NCBI
w.ncbi.nlm.nih.gov/) web servers. Consistency in
nd structural assignment within a COG coupled with
s between COG and PDB confirmed the best matches
ionally the same protein. Additionally, manual refine-
sed to verify uniform sample conditions (i.e., the same
d to all proteins within a COG or all proteins correspond
e sequences) for cases of redundantly solved structures.
CATH linkage was obtained directly from the CATH
se. The CATH classification for each structural domain
files listed in Table 1 was manually verified using the
h engine. This was important because, even though 32
OG structure families are single-domain proteins, the
6 COG structure families have two or three domains.
re based phylogenetic trees
tion to pairwise alignment, all the protein struc-
each COG were simultaneously aligned using
le structure alignment program MAMMOTH-multi
bm.uam.es/mammoth/mult/) (Lupyan et al., 2005). The
ligned structures and the structure-based sequence
was used with in-house software to calculate an
ll matrix of per-residue C distances. Standard boot-
chniques was then applied to the all-versus-all matrix
ue C distances to generate 100 distance-matrix tables.
f structure-based sequence alignments with the cor-
C distances were randomly selected until the total
columns in the original sequence alignment was
e resulting set of C distances was then used to cal-
ot mean square deviation (rmsd) between each pair
s in the matrix. The 100 distance-matrix tables were
to PHYLIP v3.68 (Felsenstein, 1989) to generate a con-
logenetic tree with bootstrap confidence levels (Efron
.
of 100 bootstrapped distance matrices was analyzed
h–Margoliash method implemented in PHYLIP. Each
jumbled with 100 replicates using a random number
eed. This resulted in 10,000 unique and random dis-
ices for each COG. The best tree was identified with
Consense implemented in PHYLIP using the extended
le conservation. Since the bootstrapped trees do not
ce relationship, the original distance matrix generated
TH-multi was used to generate a distance based phy-
ee. Each original distance matrix was jumbled with 100
sing a random number seed. The distance based phylo-
was drawn using the program Drawtree implemented
ach tree was visually inspected and compared with the
lysis using the bootstrap values to determine if a tree
urst, split or split + 1 classification.
Page 3
hidden
26 M.D. Shortridge et al. / Computational Biology and Chemistry 35 (2011) 24–33
Table 1
COG structure families.a
COG Function Sgo sim(COG)b Phylogenetic
structure treec
CATH Which domain?
28 Thiamine pyrophosphate requiring enzymes 0.59 Split 3.40.50.970 1st
3.40.50.1220 2nd
3.40.50.970 3rd
39 Malate/lactate dehydrogenases 0.80 Split 3.40.50.720 Single domain
394 Protein-tyrosine-phosphatase 0.61 Split 3.40.50.270 Single domain
446 Uncharacterized NAD (FAD)-dependent dehydrogenases 0.85 Split 3.50.50.60 1st
3.50.50.60 2nd
3.30.390.30 3rd
604 NADPH:quinone reductase and related Zn-dependent oxidoreductases 0.88 Split 3.40.50.720 Single domain
605 Superoxide dismutase 0.76 Split d 1st and 2nd
742 N6-adenine-specific methylase 0.73 Split d Single domain
813 Purine-nucleoside phosphorylase 0.87 Split 3.40.50.1580 Single domain
1012 NAD-dependent aldehyde dehydrogenases 0.58 Split 3.40.309.10 1st and 2nd
1057 Nicotinic acid mononucleotide adenylyltransferase 0.95 Split 3.40.50.620 Single domain
1075 Predicted acetyltransferases and hydrolases with the alpha/beta hydrolase fold 0.70 Split 3.40.50.1820 Single domain
1607 Acyl-CoA hydrolase 0.87 Split d Single domain
1940 Transcriptional regulator/sugar kinase 0.31 Split 3.30.420.40 1st
3.30.420.160 2nd
2124 Cytochrome P450 0.80 Split 1.10.630.10 Single domain
2188 Transcriptional regulators 0.89 Split 3.40.1410.10 Single domain
242 N-formylmethionyl-tRNA deformylase 0.87 Split with HGT 3.90.45.10 Single domain
1052 Lactate dehydrogenase and related dehydrogenases 0.89 Split with HGT 3.40.50.720 1st and 2nd
2141 Coenzyme F420-dependent N5,N10-methylene tetrahydromethanopterin
reductase and related flavin-dependent oxidoreductases
0.76 Split with HGT 3.20.20.30 Single domain
3832 Uncharacterized conserved protein 1.00 Split with HGT 3.30.530.20 Single domain
110 Acetyltransferase (isoleucine patch superfamily) 0.56 Starburst 2.160.10.10 Single domain
171 NAD synthase 0.85 Starburst 3.40.50.620 Single domain
251 Putative translation initiation inhibitor, yjgF family 0.00 Starburst 3.30.1330.40 Single domain
346 Lactoylglutathione lyase and related lyases 0.11 Starburst 3.10.180.10 Single domain
366 Glycosidases 0.51 Starburst 3.20.20.80 1st
3.90.400.10 2nd
2.60.40.1180 3rd
454 Histone acetyltransferase HPA2 and related acetyltransferases 0.83 Starburst 3.40.630.30 Single domain
491 Zn-dependent hydrolases, including glyoxylases 0.50 Starburst 3.60.15.10 Single domain
500 SAM-dependent methyltransferases 0.59 Starburst 3.40.1630.10 1st
3.40.50.150 2nd
526 Thiol-disulfide isomerase and thioredoxins 0.96 Starburst 3.40.30.10 Single domain
590 Cytosine/adenosine deaminases 0.70 Starburst 3.40.140.10 Single domain
637 Predicted phosphatase/phosphohexomutase or 1.10.164.10 0.52 Starburst 3.40.50.1000 1st
1.10.150.240 2nd
664 cAMP-binding proteins 0.50 Starburst 2.60.120.10 1st
1.10.10.10 2nd
745 Response regulators consisting of a CheY-like receiver domain and a
winged-helix DNA-binding domain
0.73 Starburst 3.40.50.2300 Single domain
753 Catalase 0.93 Starburst d Single domain
778 Nitroreductase 0.64 Starburst 3.40.109.10 Single domain
784 FOG: CheY-like receiver 0.48 Starburst 3.40.50.2300 Single domain
796 Glutamate racemase 0.92 Starburst 3.40.50.1860 1st and 2nd
1028 Dehydrogenases with different specificities (related to short-chain alcohol
dehydrogenases)
0.84 Starburst 3.40.50.720 Single domain
1151 6Fe–6S prismane cluster-containing protein 0.71 Starburst 3.40.50.2030 1st
3.40.50.2030 2nd
1.20.1270.30 3rd
1309 Transcriptional regulator 0.80 Starburst 1.10.10.60 1st
1.10.357.10 2nd
1396 Predicted transcriptional regulators 0.54 Starburst 1.10.260.40 1st
2.60.120.10 2nd
1404 Subtilisin-like serine proteases 0.60 Starburst 3.40.50.200 Single domain
1733 Predicted transcriptional regulators 1.00 Starburst d Single domain
1846 Transcriptional regulators 0.85 Starburst 1.10.10.10 Single domain
2159 Predicted metal-dependent hydrolase of the TIM-barrel fold 0.83 Starburst 3.20.20.140 Single domain
2367 Beta-lactamase class A 0.93 Starburst 3.40.710.10 Single domain
2730 Endoglucanase 0.88 Starburst 3.20.20.80 Single domain
3693 Beta-1,4-xylanase 0.89 Starburst 3.20.20.80 Single domain
4948 l-alanine-dl-glutamate epimerase and related enzymes of enolase superfamily 0.71 Starburst 3.30.390.10 1st
3.20.20.120 2nd
a COG structure families have two or more represented structures from among the Firmicutes and two or more from among the Proteobacteria
b Normalized GO functional similarity score between each protein’s GO term set and the consensus GO term set for the COG (Eq. (2))
c “Split” means the Firmicutes and Proteobacteria proteins were strongly separated from one another, “Starburst” means there was little to no evidence for a split according
to phyla, and “Split with HGT” means there was strong evidence for a split according to phyla with the exception of one protein, which may indicate horizontal gene transfer
(described as split + 1 in the text). See Supplementary Table IS for a list of the PDB files associated with each COG.
d The protein structures in this COG family are in the CATH holding pen awaiting manual domain separation and/or final CATH assignment.
Page 4
hidden
M.D. Shortridge et al. / Computational Biology and Chemistry 35 (2011) 24–33 27
2.5. Measuring functional similarity within a COG
Each protein in our dataset was annotated with the correspond-
ing Gene Ontology (Ashburner et al., 2000) identification number
found in th
protein to s
of GO terms
proteins are
measured b
assigned to
tance is rep
identical fu
of function
score betw
term set for
S
go sim
(COG
where Sgo si
GOcog wc de
and GOcog(p
the COG.
3. Results
3.1. Creatin
Current
the Gene On
sification (E
potential fo
not useful f
observe ph
important t
Among the
clusters of o
attempts to
viding mod
and structu
Additionall
relative fun
This was ac
et al., 2010)
biologically
structure w
tures from
further ana
The mo
genome-sp
(50 bacteri
orthologs p
matically an
component
COG databa
COGs, accou
At the t
45,368 prot
of multiple
number of
which acco
were select
protein stru
for each of
sequences
(Altschul et
to maximize the likelihood of matching each PDB with its cor-
rect COG. The BLAST comparison matched 82% of the Firmicutes
and Proteobacteria sequences to specific COGs, resulting in the
clustering of 2728 Firmicutes structures and 6881 Proteobacteria
res.
eque
y. To
hat c
oteob
ith
roteo
/eggn
ider
erat
as n
ylum
urth
ity fo
e be
weak
OG e
g fun
ition
OG,
t tha
G. N
ve t
lglut
onal
d 0.
re d
t. All
ssifie
ted
ted
o fun
e tw
re.
irwis
pair
was u
he p
ity c
rison
nd 1
rison
ove
utor
nctio
expe
r of t
Tabl
rmicu
teria
resu
s w
ratin
west
ce id
dash
a ran
of t
nd r
(Bacie PDB. By definition, a strong consensus requires each
hare the same GO term. Instead, a weak consensus set
was generated for each COG, where only a majority of
required to share the same GO term. A distance was
etween the weak consensus set and the set of GO terms
each individual protein. An average, normalized dis-
orted for each COG, where a score of 1 indicates an
nctional classification and a score of 0 indicates a lack
al similarity. The normalized GO functional similarity
een each protein’s GO term set and the consensus GO
the COG was measured as follows:
) =

p ∈ COG


GO
cog
(p) ∩ GO
cog wc




GO
cog
(p) ∪ GO
cog wc


(2)
m(COG) is the normalized GO functional similarity score,
notes the weak consensus set of GO terms for the COG,
) denotes the set of GO terms set for each protein p in
g the COG structure families
functional annotation tools available in the PDB include
tology (GO) (Ashburner et al., 2000) and Enzyme Clas-
C) (Schomburg et al., 2004). Unfortunately, due to the
r convergence of function, these annotation tools are
or the study of homologous structures. To accurately
ylum dependent structure divergence of proteins, it is
o construct a dataset of functionally similar orthologs.
20 resources for structural classification of proteins, the
rthologous groups (COGs) scheme is the only one that
identify orthology (Ouzounis et al., 2003) while pro-
erate functional information. Therefore, each sequence
re in the PDB was annotated with one COG number.
y, each protein was annotated with GO numbers and the
ctional similarity for each COG was measured (Table 1).
hieved by developing the PROFESS database (Triplet
that contains the PDB to COG annotations among other
relevant information. This includes associating each
ith its phyla classification, which allowed for the struc-
Firmicutes and Proteobacteria to be easily selected for
lysis (Table S1).
st recent COG database was created by finding the
ecific best-hit for each gene in 66 unicellular genomes
a, 13 archaea, and 3 eukaryota). Specifically, the
resent in three or more genomes were detected auto-
d then multidomain proteins were manually split into
domains to eliminate artifactual lumping. The online
se contains 192,987 sequences distributed among 4876
nting for 75% of genes in these 66 genomes.
ime of our COG-to-PDB annotation, the PDB included
ein structures, although many of them were composed
subunits (and therefore associated with an even larger
sequences). The two best-represented bacterial phyla,
unts for nearly one-fourth of all structures in the PDB,
ed for annotation. The PDB contains 8298 Proteobacteria
ctures and 3416 Firmicutes structures. The sequences
these structures were compared to the COG reference
using the Basic Local Alignment Search Tool (BLAST)
al., 1990). An expectation cut-off of 1 × 10−9 was used
structu
ence s
identit
COGs t
two Pr
COGs w
1981 P
(http:/
(Schne
for gen
there w
per ph
To f
similar
distanc
and a
each C
1 bein
In add
each C
suppor
rect CO
(putati
(lactoy
scripti
0.11 an
COGs a
datase
are cla
associa
associa
have n
the fiv
structu
3.2. Pa
The
2000)
all of t
similar
compa
(+/+), a
compa
to rem
contrib
non-fu
native
shorte
COGs (
147 Fi
teobac
The
parison
a satu
The lo
sequen
of 2.0 (
above
parison
lyase a
2QQZOf these hits, 27% were 100% identical to the COG refer-
nce and 97% matched with greater than 50% sequence
carry out our comparative study, we selected only those
ontained a minimum of two Firmicutes organisms and
acteria organisms. This requirement gave 281 unique
a total of 3047 bacterial proteins (1066 Firmicutes and
bacteria). In addition to COG clustering, the eggNOG
og.embl.de) (Muller et al., 2010) and OMA databases
et al., 2007) (http://omabrowser.org) were also mined
ing orthologous sets of protein structures. However,
o set of proteins that met our criteria of two structures
per cluster.
er support the COG–PDB clusters, the overall functional
r each COG was determined by measuring the average
tween the Gene Ontology annotations for each protein
consensus list of GO annotations (Table 1). Overall
xhibited high functional similarity (0.72 ± 0.21) with
ctionally identical and 0 being functionally dissimilar.
to the high sequence and structure similarity within
the GO functional similarity measure provides further
t the proteins have been properly annotated to the cor-
evertheless, there are three apparent outliers; COG0251
ranslation initiation inhibitor, yjgF family), COG0346
athione lyase and related lyases) and COG1940 (tran-
regulator/sugar kinase) have GO similarity scores of 0,
31, respectively. The low GO similarity scores for these
riven by the inclusion of unannotated proteins in the
six single-domain proteins associated with COG0251
d as a conserved hypothetical protein and have no
GO terms. Of the seventeen single-domain proteins
with COG0346, nine lack GO term assignments and
ctional annotation. Additionally for COG1940, two of
o-domain proteins have no GO terms assigned to the
e structure similarity
wise structure comparison tool DaliLite (Holm and Park,
sed to perform 63,504 pairwise comparisons between
roteins in our dataset. In total, the backbone structure
orresponded to 31,542 Proteobacteria–Proteobacteria
s (−/−), 12,674 Firmicutes–Firmicutes comparisons
9,288 Proteobacteria–Firmicutes comparisons (−/+). All
s were manually filtered within their respective COG
all but one redundantly solved structure (the largest
to the size reduction of the dataset), multiple or
nally relevant conformations (mutant protein, non-
rimental conditions, inhibited ligand complex), and the
wo protein structures. The final dataset contained 48
e I) with a total of 1713 structural comparisons among
tes proteins from 58 unique organisms and 176 Pro-
proteins from 84 unique organisms.
lting Dali Z-scores from the pairwise structure com-
ere plotted against sequence identity (Fig. 1) to reveal
g relationship as the percent identity rose to 100%.
observed Z-score was 5.7 with a corresponding 16%
entity. This Z-score was well above the minimum cutoff
ed line) for matches that were two standard deviations
dom match. This lowest Z-score came from the com-
wo Firmicutes proteins in COG0346 (lactoylglutathione
elated lyases): 2QH0 (Clostridium acetobutylicum); and
llus anthracis). The average Z-score for all comparisons
Page 5
hidden
28 M.D. Shortridge et al. / Computational Biology and Chemistry 35 (2011) 24–33
Fig. 1. The relationship between structure similarity and sequence identity for 48
COGs. Structure similarity is given as the raw Z-score, which increases as the protein
length increases. The comparisons were for all proteins against all proteins, and
include those for each protein against itself. The dashed line identifies a Dali Z-score
of 2, which is the minimal limit for inferring structural similarity.
between these single-domain proteins was 27 ± 13, indicating that
all structural matches were very significant even at sequence iden-
tities below 20%. All structure comparisons corresponding to 100%
sequence identity in Fig. 1 result from a protein structure compared
against itself. The inherent range in Z-scores at 100% sequence
identity highlights the need to develop a normalized structure com-
parison sco
Since Z-
normalized
ilarity (FSS)
plotted aga
obtained w
identity. In
60%. This F
(Fig. 2A), w
were used (
between th
ison plot be
at 61% sequ
fact created
Fig. 2. The Fra
FSS was calcu
ent sizes. The
pairwise comp
Firmicutes–Firm
isons.
manual filtering also demonstrated the same effect (Supplemental
Fig. 1).
The protein structures in COG0028 (thiamine pyrophosphate
requiring enzymes) provides a useful example of the structural
divergence
phyla split.
there are di
lum. The tw
of 59.6 and
servation. T
that yield
of 0.58 ± 0.
the slightly
Firmicutes a
yield a low
then the co
gence in st
detailed an
tures from
the Firmicu
compared t
ria structur
where a lon
ture is bro
et al., 2007
2005).
3.3. COG st
ctur
diffe
pro
A se
thre
ted a
atte
la, an
epti
show
pend
s wi
ction
0.09
, res
OG a
tion
15 C
o bre.
scores increase as a function of the protein length, we
for this effect by calculating a Fractional Structure Sim-
scores (see Eq. (1)). When the pairwise FSS scores were
inst sequence identity (Fig. 2), a hyperbolic curve was
ith all FSS values below an upper-limit at each percent
fact, 20% sequence identity yielded a maximal FSS of
SS limit was observed when all of the data were used
hen only the pairwise comparisons within either phyla
Fig. 2B and C), or when only the pairwise comparisons
e two phyla were used (Fig. 2D). The pairwise compar-
tween the two phyla (Fig. 2D) showed an abrupt cutoff
ence identity and a 0.84 FSS score. This was not an arti-
by culling the dataset, since a similar plot prior to the
Stru
square
aligned
2005).
where
exhibi
burst p
to phy
the exc
As
not de
protein
GO fun
0.88 ±
pattern
each C
annota
The
had twctional Structure Similarity (FSS) and sequence identity for 48 COGs.
lated using Eq. (1) to normalize the Dali Z-scores for their differ-
FSS values were plotted against sequence identity for (A) all the
arisons, (B) only Proteobacteria–Proteobacteria comparisons, (C) only
icutes comparisons and (D) only Proteobacteria–Firmicutes compar-
tures and t
Two exam
ing enzyme
dehydrogen
domain CO
sandwiches
fold topolo
binding dom
The 29
no evidenc
(Table 1). Tw
and COG13
fication for
(PDB ID: 1B
/ 4-laye
ogy. The la
of enzymes
teins that bthat occurred after the Firmicutes and Proteobacteria
The overall fold is conserved between the phyla while
screte structural elements that are unique to each phy-
o Firmicutes structures (Fig. 3A and B) yield a Z score
an FSS of 0.83, indicating very high structural con-
here are more representative Proteobacteria structures
an average Z-score of 37.7 ± 1.6 and an average FSS
03. Again, the structures share a similar fold despite
lower scores. Comparison of structures between the
nd Proteobacteria (Fig. 3C and D, respectively) phyla
er Z-score of 34.8 ± 1.2 and a lower FSS of 0.49 ± 0.02
mparisons within each phylum. This suggests a diver-
ructural details while conserving the overall fold. A
alysis reveals localized differences between the struc-
the two phyla (see red highlights in Fig. 3C and D). In
tes representative structure, there is a continuous helix
o helical breaks and loop insertions in the Proteobacte-
e. This is similar to the C-terminal domain of primase,
g continuous helix found in the Escherichia coli struc-
ken by a loop region in B. stearothermophilus (Bailey
; Oakley et al., 2005; Su et al., 2006; Syson et al.,
ructure phylogenies
e based phylogenies were created from root-mean
rences (rmsd) in per residue C positions for optimally
tein structures using MAMMOTH-multi (Lupyan et al.,
parate phylogenetic tree was generated for each COG,
e distinct patterns were observed (Table I): 15 trees
strong split at the phylum level, 29 exhibited a star-
rn suggesting little to no evidence for a split according
d 4 exhibited a strong split at the phylum level but with
on of a single structure (split + 1).
n in Table 1, the pattern of the structure based tree is
ent on the relative GO functional similarity score for the
thin each COG. All three tree patterns have a range of
al similarity scores with an average score of 0.75 ± 0.16,
and 0.70 ± 0.24 for the split, split + 1, and starburst tree
pectively. Overall the high GO similarity scores within
re high, indicating conserved and consistent functional
s for each COG.
OG phylogenies with strong phylum-splitting patterns
ranches, one with closely related Firmicutes struc-
he other with closely related Proteobacteria structures.
ples are COG0028 (Thiamine pyrophosphate requir-
s) and COG0446 (Uncharacterized NAD/FAD-dependent
ases) (Fig. 4). The structures for both of these multi-
Gs are classified in the CATH system as / 3-layer
, but differ in that COG0028 proteins have a Rossmann
gy (Fig. 3) and COG0046 proteins have a FAD/NAD (P)-
ain topology.
COGs with phylogenetic starburst patterns showed
e for the separation of structures according to phyla
o examples were COG0491 (Zn-dependent hydrolases)
09 (transcriptional regulator) (Fig. 4). The CATH classi-
COG0491 Bacillus cereus Zinc-dependent -lactamase
C2) (Fabiane et al., 1998) describes the protein as an
r sandwich with metallo--lactamase Chain A topol-
rge category of -lactamases constitutes a collection
that can be derived from any one of a group of pro-
ind, synthesize, or degrade peptidoglycans. The protein
Page 6
hidden
M.D. Shortridge et al. / Computational Biology and Chemistry 35 (2011) 24–33 29
Fig. 3. Compa uctur
that the two Fi cteria
structure tree the Fi
Firmicutes (C) presen
2AG0) (Mosba
structures a
dard deviat
the Proteob
The two
CATH topol
and Tetracy
ilar to the A
for the pai
this COG. T
ily gave low
the average
nificant ove
the COG04
of the COG
divergence
starburst ph
Four CO
with a sing
for the obse
ity. The pres
a horizonta
all four fam
COG1052 (
COG2141 (c
dromethan
doreductas
there was a
ing a correc
all four COG
gesting a corison of protein structures for COG0028 between two bacterial phyla. The protein str
rmicutes structures have highly overlapping structures and (B) that the four Proteoba
for COG0028 in Fig. 4). On the other hand, the major structural differences between
structure from L. plantarum (Lpl) (PDB ID: 1POW) (Muller et al., 1994) and the re
cher et al., 2005). (For interpretation of the references to color in text, the reader is referr
ssigned to COG0491 gave FSS scores with large stan-
ions, as is evident from the separated clusters within
acteria arm of the phylogenetic tree.
-domain COG1309 structural family falls into one of two
ogies represented by Arc Repressor Mutant subunit A
cline Repressor domain 2. Only those structures sim-
rc Repressor Mutant (subunit A) topology were used
rwise comparison, since it was the dominant fold in
he protein structures in the COG1309 structure fam-
FSS scores. However, even with a low overall FSS,
absolute Z-score was 13 ± 2 indicating that it has sig-
rall structure similarity. The high FSS deviations of
91 structural family and the low average FSS scores
1309 structural family both indicate rapid structural
following the phyla split, consistent with the observed
ylogenetic patterns.
G structure phylogenies showed a strong split pattern
le outlier (Fig. 5). This result provides further evidence
rvation of phyla split based on protein structure similar-
ence of the outlier in a clear split pattern suggests either
lly transferred gene (Table I) or a potential paralog. For
ilies [COG0242 (N-formylmethionyl-tRNA deformylase)
lactate dehydrogenase and related dehydrogenases),
oenzyme F420-dependent N5,N10-methylene tetrahy-
opterin reductase and related flavin-dependent oxi-
es), and COG3832 (uncharacterized conserved protein)]
large Dali Z-score and reliable BLAST E-values, imply-
t match was made between COG and PDB. Additionally
s exhibited high GO functional similarity scores sug-
nsistent functional assignment (Table 1). For COG0242,
the Bacillus
tRNA defor
identified a
(Garcia-Val
3.4. Structu
As a wa
sequence d
a single co
(FSS) and a
for all 48
Proteobacte
dividing by
Avg(FSS
−/−
FSS = Avg(
Similarly
mined by
the Pro
Avg(SeqID+
age Pro
Firmicutes–
SeqID = Av
In gene
COG0491 aes for COG0028 thiamine pyrophosphate requiring enzymes show (A)
structures are very similar to each another (see also the phylogenetic
rmicutes and Proteobacteria are highlighted in red on a representative
tative Proteobacteria structure (D) from P. fluorescens (Pfl) (PDB ID:
ed to the web version of the article.)
cereus gene def that encodes the N-formylmethionyl-
mylase protein (PDB ID: 1WS0) has been previously
s a gene that has undergone horizontal gene transfer
lve et al., 2000).
re divergence rates across phyla
y to quantify the relationship between structure and
ifferences, each phylogenetic tree was reduced to
ordinate by calculating a structure similarity ratio
sequence identity ratio (SeqID). FSS was determined
COGs by calculating an average FSS score for the
ria–Firmicutes structure comparisons, Avg(FSS+/−), and
the sum of the average Proteobacteria–Proteobacteria,
), and Firmicutes–Firmicutes, Avg(FSS+/+), comparisons:
Avg(FSS
+/−
)
FSS
+/+
)/2 + Avg(FSS
−/−
)/2
(3)
, a sequence identity ratio (SeqID) was deter-
calculating an average sequence identity for
teobacteria–Firmicutes structure comparisons,
/−), and dividing by the sum of the aver-
teobacteria–Proteobacteria, Avg(SeqID
−/−), and
Firmicutes, Avg(SeqID+/+), comparisons:
Avg(SeqID
+/−
)
g(SeqID
+/+
)/2 + Avg(SeqID
−/−
)/2
(4)
ral, most starburst phylogenies (see representative
nd COG1309 in Fig. 4) had a branch length between
Page 7
hidden
30 M.D. Shortridge et al. / Computational Biology and Chemistry 35 (2011) 24–33
Fig. 4. Protein
burst patterns
(top) strong sp
to a common
The Firmicutes
bootstrap valu
resent how of
the split patter
and COG0446
of a strong sp
100 replicate
(Zn-dependen
patterns, very
cate trials. The
A. viridians (Av
(Bth); E. caroto
(Kpn); L. lactis
aeruginosa (Pa
(Psp); S. aureu
(Xma).
members o
lengths bet
burst phylo
greater tha
branches be
tive COG00
FSS and Se
When 
another (Fi
or greater
84% of the
ture and 7
for sequenc
the structu
In addition
ear relation
indicating t
constant w
ences (FSS
the relative
between thstructure based phylogenetic trees highlighting the split and star-
. The phylogenetic structure trees showed three different patterns:
lit according to phyla; (bottom) starburst with no clear relationship
ancestor; and (Fig. 5) strong splits with the exception of one outlier.
protein structures are in blue and the Proteobacteria in black. The
es from 100 bootstrap replicates are indicated on branches and rep-
ten a branch appeared in the distance matrix. The two examples for
n were from COG0028 (thiamine pyrophosphate requiring enzymes)
(uncharacterized NAD(FAD)-dependent dehydrogenases). In the case
lit, the central branches were observed more than 95 times out of
trials. The two examples for starburst pattern were from COG0491
t hydrolases) and COG1309 (transcriptional regulator). For starburst
few branches were observed in more than two-thirds of the 100 repli-
organism abbreviations are: A. hydrophila (Ahy); A. tumefaciens (Atu);
i); B. cereus (Bce); B. japonicum (Bja); B. subtilis (Bsu); B. thuriagienes
vora (Eca); E. coli (Eco); E. faecalis (Efa); F. gormanii (Fgo); K. pneumonia
(Lla); L. sanfranciscens (Lsa); L. plantarum (Lpl); O. formigens (Ofo); P.
e); P. fluorescens (Pfl); P. pantotrophus (Ppa); P. putida (Ppu); P. species
s (Sau); S. marcescens (Sma); S. typhimurium (Sty); and X. maltophilia
f different phyla that was much shorter than the branch
ween members within the same phyla. That is, a star-
geny was expected to have FSS and SeqID values
n unity. Likewise, most split phylogenies had longer
tween phyla than within each phyla (see representa-
28 and COG0446 in Fig. 4) and were expected to yield
qID of less than unity.
FSS and SeqID for all 48 COGs were plotted versus one
g. 6), 79% of the starburst phylogenies were equal to
than unity for both structure and sequence whereas
split phylogenies were below a FSS of 0.9 for struc-
3% of split phylogenies were below a SeqID of 0.80
e. This indicated that split phylogenies occur when
re differences are less than their sequence differences.
, the plot of FSS versus SeqID conformed to a lin-
ship regardless of the shape of the phylogenetic tree
hat all homologous protein structure differences are
ith respect to homologous protein sequence differ-
= 0.55SeqID + 0.45; R2 = 0.7). Thus, this curve represents
structural drift rate for each COG structural family
e two phyla. The slope indicates that structure branch
Fig. 5. Protein structure based phylogenetic trees highlighting the split + 1 pat-
tern. Protein structure phylogenies of 4 COGs out of 48 had a strong split
pattern with the exception of one outlier structure. The phylogenies were very
reliable because the central branches were observed in 100 out of 100 repli-
cate trials. When one Firmicutes or Proteobacteria protein structure clusters on
a branch with the other phylum, its structure diverges from its closest relatives
while resembling those of the other phyla. The COGs that fit this pattern are
from COG0242 (N-formylmethionyl-tRNA deformylase), COG1052 (lactate dehy-
drogenase and related dehydrogenases), COG2141 (coenzyme F420-dependent N5,
N10-methylene tetrahydromethanopterin reductase and related flavin-dependent
oxidoreductases), and COG3832 (uncharacterized conserved protein). The organism
abbreviations are: A. fermentans (Afe); A. tumefaciens (Atu); B. cereus (Bce); B. halo-
durans (Bha); B. stearothermophilus (Bst); B. subtilis (Bsu); C. violaceum (Cvi); E. coli
(Eco); E. faecalis (Efa); H. methylovorum (Hme); H. pylori (Hpy); L. delbrueckii (Lde);
L. helveticus (Lhe); M. species (Msp); N. europaea (Neu); P. aeruginosa (Pae); P. species
(Psp), S. aureus (Sau); S. pneumoniae (Spn); and V. harveyi (Vha).
Fig. 6. Constant rate of structural drift. The relationship between structure and
sequence change was constant regardless of the phylogenetic starburst (×) or split
() pattern. Structure changes measured using a structure similarity ratio (FSS),
where the average FSS between members of the two phyla (Firmicutes versus Pro-
teobacteria) was divided by the average FSS between members of the same phyla
(see Eq. (3)). Sequence change was calculated similarly (see Eq. (4)). The best-fit line,
FSS = 0.55SeqID + 0.45, yielded an R2 of 0.70.
Page 8
hidden
M.D. Shortridge et al. / Computational Biology and Chemistry 35 (2011) 24–33 31
Fig. 7. Fold de
parisons. The F
1.10 (mainly 
40% sequence
fills in the com
a starburst str
represented b
12 COGs forme
lengths cha
lengths.
3.5. Fold de
A plot o
lated CATH
if particular
changes. Th
of our data s
wich). The
the split ph
55%) then th
The seco
orthogonal
family. Mos
ily are repre
only one CO
limit in stru
sponding se
circles). Thi
open diamo
CATH 1.10 c
the starburs
mutations t
4. Discussi
There is
tional anno
number of
Frishman, 2
Valencia, 20
ambiguous
2009) and r
to improve
also the co
orthologs a
the COG da
et al., 2006;
of structura
functional a
lenges, the
GO terms provides a reasonable and robust approach to identify
clusters of functionally similar proteins. The overall high sequence
(E-value ≤ 10−9, sequence identity ≥ 16%), structure (Z-score > 5.7)
and GO term similarity (0.72 ± 0.21) within each COG supports this
sion.
not
ms a
m fo
has
prot
ce th
Lindo
com
unct
iver
n bac
at 61
obse
irror
twee
ity (
bser
le str
two
ce id
nd f
Rost
ble s
y, th
entic
umb
t stru
e sam
uctu
yl-tR
ate k
(Kis
t al.,
ty pe
tha
ith th
or a
vide
s clea
. The
herependency on Fractional Structure Similarity (FSS) and sequence com-
SS between two CATH families, CATH 1.10 () CATH 3.40 (♦). CATH
, orthogonal bundle) family is apparently limited to approximately
identity and 0.6 FSS while CATH 3.40 (/, 3-layer () sandwich)
plete curve. 87.5% of the COGs (7 of 8) represented by CATH 1.10 give
ucture similarity tree. Contrastingly, only 50% (12 of 24) of the COGs
y CATH 3.40 give a starburst structure similarity tree. The remaining
d either split (11 of 12) or split + 1 (1 of 12).
nge approximately half as fast as sequence branch
pendency on structure similarity
f FSS versus sequence identity for the two most popu-
families in our dataset (Fig. 7) was used to investigate
protein architectures are more amenable to structural
irty-one of 66 total domains (47%), the largest portion
et, are classified as CATH 3.40 (/, 3-layer () sand-
CATH 3.40 classification is more often associated with
ylogenetic tree pattern (12 out of 22 total domains or
e starburst pattern (17 out of 39 total domains or 44%).
nd most populous CATH family is CATH 1.10 (mainly ,
bundle) with 11% of our COGs belonging to this CATH
t (85.7%) of the COGs (6 of 7) in the CATH 1.10 fam-
sented by the starburst phylogenetic tree pattern with
G represented by a split pattern. There appears to be a
cture similarity at approximately 0.6 FSS and a corre-
quence identity limit at 40% for CATH 1.10 (Fig. 7, solid
s limit is not observed in the CATH 3.40 family (Fig. 7,
nds). The sequence and structure similarity limit for
ombined with a larger percentage of COGs assigned to
t family suggests that CATH 1.10 is more susceptible to
hat affect the protein structure.
conclu
should
GO ter
GO ter
protein
larly, a
eviden
1996;
2004).
The
same f
tures d
moder
cutoff
ilarity
was m
tity be
similar
mum o
possib
these
sequen
tural a
1999;
allowa
tionall
non-id
are a n
nifican
but th
the str
peptid
tothen
factors
(Ibba e
For
enough
tent w
rapidly
to pro
COGs i
(Fig. 6)
rate, won
an inherent challenge in obtaining an accurate func-
tation for a large set of proteins from a relatively small
experimentally determined functions (Andrade, 2003;
007; Karp et al., 2001; Rentzsch and Orengo, 2009;
05). The available functional information is incomplete,
and error-prone (Benitez-Paez, 2009; Schnoes et al.,
equires multiple sources (Rentzsch and Orengo, 2009)
the accuracy in the annotation of a protein. There is
mplicating factor of correctly distinguishing between
nd paralogs, where it has been previously noted that
tabase does include some paralog members (Dessimoz
Tatusov et al., 2003). Thus, the accuracy of our analysis
l divergence is fundamentally dependent on a reliable
ssignment for each protein structure. Given these chal-
independent and separate utilization of both COG and
This correla
structures h
Our analys
large seque
(Illergard e
Williams an
it challengi
function, as
Pascual-Ga
Does th
ture play a
demonstrat
high sequen
while other
identity is
architectur
to amino-a
ity. A specifi
enables it toThe lack of identity for the GO term similarity scores
be interpreted as evidence for functional divergence.
re assigned based on a validated source. So, a missing
r a protein is more likely attributed to the fact that the
not been explicitly tested for the specified activity. Simi-
ein being assigned a GO term does not provide definitive
at the function is relevant in vivo (Canevascini et al.,
rff-Larsen et al., 2001; Otsuka et al., 2002; West et al.,
parison of homologous protein structures with the
ion provides quantitative evidence that protein struc-
ged following the speciation events that created the
terial phyla of Firmicutes and Proteobacteria. The abrupt
% sequence identity and 0.84 Fractional Structure Sim-
rved between Firmicutes and Proteobacteria proteins
ed by an approximate 60% protein sequence iden-
n these two phyla observed by 16S rRNA sequence
Konstantinidis and Tiedje, 2005a,b). Thus, this maxi-
ved sequence identity imparts limits to the maximum
ucture similarity between homologous proteins from
phyla. This is consistent with prior observations that
entity ≤ 40–50% sometimes results in significant struc-
unctional differences (Chothia and Lesk, 1986; Rost,
, 2002). Furthermore, the results imply an inherent
tructural plasticity that does not perturb function. Addi-
e random drift after speciation inexorably leads to
al structures despite maintenance of function. There
er of cases where FSS was below 0.20 indicating a sig-
ctural change. Proteins with completely different folds
e function are extreme examples of the plasticity of
re–function relationship and include such proteins as
NA hydrolases (COG1990) (Powers et al., 2005), pan-
inase (KOG2201) (Yang et al., 2006), polypeptide release
selev, 2002) and lysyl-tRNA synthetases (COG1190)
1997), these proteins are not in our dataset.
rcent of the COGs we examined have evolved slowly
t it was possible to generate phylogenetic trees consis-
is ancient split. The other COGs have either evolved too
re otherwise subject to few evolutionary constraints
evidence for this split. This distinction between the
rly apparent from the comparison of FSS and SeqID in
slope of (Fig. 6) indicates a fixed relative structure drift
structure changes half as fast as sequence across phyla.
tion in the divergence of protein sequences and protein
as additional ramifications beyond bacterial evolution.
is implies a continuum of protein folds that adapt to
nce changes by incurring local structural modifications
t al., 2009; Kolodny et al., 2006; Panchenko et al., 2005;
d Lovell, 2009). This continuum of protein folds makes
ng to apply protein structural classification to identify
has been previously noted (Hadley and Jones, 1999;
rcia et al., 2009).
e nature of the protein’s three-dimensional struc-
role in protein structure divergence? Our analysis
es that some proteins evolve slowly and maintain
ce identity (>80%) and structure similarity (>0.80 FSS)
proteins exhibit rapid evolution rates where sequence
≤20% and FSS ≤ 0.40. This implies that the underlying
e of a particular protein may be more or less amenable
cid substitutions in order to maintain functional activ-
c protein fold may have a higher intrinsic plasticity that
readily accommodate sequence changes through local
Page 9
hidden
32 M.D. Shortridge et al. / Computational Biology and Chemistry 35 (2011) 24–33
conformational changes without a detrimental impact on activity.
This is exactly what was observed.
Structural variations were localized to specific regions as illus-
trated by the comparison of the COG0028 protein structures (see
Fig. 3). This
ent structur
et al., 2007
activity are
larger local
Chothia, 19
would expe
where the
(Illergard e
sensitive to
turbations
correspond
known rang
Murphy, 19
destabilize
For insta
tein archite
domains. It
protein stru
dle. Conver
sandwich w
structures.
actions and
compared t
tively, the p
folding (Par
of a protein
bic core (Vl
3-layer (
tolerated.
Our stud
functionally
challenge in
tal requirem
organisms a
4876 COGs
multiple ho
tural biolog
structures f
Brenner, 20
relevant pro
rate unders
protein fold
function, th
expand the
homologou
Disclaimer
The con
authors and
National In
Acknowled
We wou
sity of Nebr
similarity s
National In
R21AI08115
ical Resear
Council Int
was performed in facilities renovated with support from NIH
(RR015468-01).
Appendix A. Supplementary data
plem
line v
nces
, S.F., e
, M.A
bases
er, M
e Onto
., et al
ain of
Paez,
databa
, H.M.,
cini, S
cco It
ia, J.M
strate
179.
G.S., e
in the
1347
, T.P.,
nce 18
, C., Le
cture i
z, C., e
oache
34, 33
Katoh
owa, N
., et al
. Sci.
S.M.,
Bacil
onucl
ein, J.,
4–16
A., 200
eins. M
r, F., e
t. Gen
n, D.,
107, 3
allve,
plete
, M.M
their
C., Jon
ns: SC
, Park,
atics
, et al.
hetase
, K., e
ence
., et a
nform
, L., 20
tion, d
, R., et
ld spa
398.
tinidis
ition
tinidis
aryot
., Cho
ein st
. Biol.
., et al
of pro
-Larse
type
7–33
D., et
mentis consistent with the observation that there are differ-
e divergence rates within a protein (Chirpich, 1975; Lin
). Regions of the protein that do not impact biological
expected to yield a higher divergence rate and incur
structural changes (Chothia and Lesk, 1986; Lesk and
80). As a result, a fold with a relatively high plasticity
rience an elevated structural diversity between phyla,
rate of change may closely parallel the mutation rate
t al., 2009). Conversely, another fold may be extremely
amino-acid substitutions, where minor sequence per-
may result in a decrease in structural integrity and a
ing loss of activity. This analysis is consistent with the
e of protein thermodynamic stabilities (Robertson and
97), and the general observation that most mutations
protein structures (Sanchez et al., 2006).
nce, CATH 1.10 was the second most abundant pro-
cture observed in our study, comprising 11% of the total
was very strongly associated with the fastest evolving
cture and corresponds to an orthogonal -helical bun-
sely, the highly populated CATH 3.40 is a 3-layer ()
ith a slower evolution rate compared to CATH 1.10
-Sheets are strongly influenced by long-range inter-
, on average, have a higher hydrophobic environment
o -helices (Gromiha and Ponnuswamy, 1995). Effec-
rotein environment is an important factor in -sheet
isien and Major, 2007). Since the stability and structure
is strongly dependent on the integrity of the hydropho-
assi et al., 1999), which is formed by the -sheet in the
) sandwich, mutations in the-sheet are probably less
y illustrates the inherent value in solving structures for
identical proteins from multiple organisms. A major
creating our COG-to-PDB dataset was the fundamen-
ent to have structures from at least two Firmicutes
nd two Proteobacteria organisms. Only 48 (∼1%) of the
meet this stringent requirement. The limited number of
mologous structures has partly occurred because struc-
y efforts are focused on obtaining single representative
or each functional class or protein fold (Chandonia and
05) and understandably biased toward therapeutically
teins (Mestres, 2005). If we are to achieve a more accu-
tanding of the relationship between the evolution of
, protein sequence, and the organisms in which they
e fields of bioinformatics and structural biology must
ir focus to include efforts to obtain a more diverse set of
s protein structures.
tent of this article is solely the responsibility of the
does not necessarily represent the official views of the
stitute of Allergy and Infectious Diseases.
gements
ld like to thank Venkat Ram Santosh from the Univer-
aska-Lincoln for his contribution to the GO functional
cores. This work was supported in part from the
stitute of Allergy and Infectious Diseases (Grant No.
4), from the Nebraska Tobacco Settlement Biomed-
ch Development Funds, and a Nebraska Research
erdisciplinary Research Grant to R.P. The research
Sup
the on
Refere
Altschul
Andrade
Data
Ashburn
Gen
Bailey, S
dom
Benitez-
ical
Berman
Canevas
toba
Chandon
tion
166–
Chang,
with
105,
Chirpich
Scie
Chothia
stru
Dessimo
appr
Res.
Do, C.B.,
(Tot
Efron, B
Acad
Fabiane,
from
mon
Felsenst
5, 16
Feng, J.-
prot
Forouha
Func
Frishma
Rev.
Garcia-V
com
Gromiha
from
Hadley,
catio
Holm, L.
form
Ibba, M.
synt
Illergard
sequ
Karp, P.D
Bioi
Kisselev
func
Kolodny
of ‘fo
393–
Konstan
defin
Konstan
prok
Lesk, A.M
prot
Mol
Lin, Y.-S
rate
Lindorff
new
3354
Lupyan,
alignentary data associated with this article can be found, in
ersion, at doi:10.1016/j.compbiolchem.2010.12.004.
t al., 1990. Basic local alignment search tool. J. Mol. Biol. 215, 403–410.
., 2003. Automatic Genome Annotation and the Status of Sequence
. Horizon Scientific Press, 107–121.
., et al., 2000. Gene ontology: tool for the unification of biology. The
logy Consortium. Nat. Genet. 25, 25–29.
., 2007. Structure of hexameric DnaB helicase and its complex with a
DnaG primase. Science 318, 459–463.
A., 2009. Considerations to improve functional annotations in biolog-
ses. OMICS 13, 527–532.
et al., 2000. The Protein Data Bank. Nucleic Acids Res. 28, 235–242.
., et al., 1996. Tissue-specific expression and promoter analysis of the
p1 gene. Plant Physiol. 112, 513–524.
., Brenner, S.E., 2005. Implications of structural genomics target selec-
gies: Pfam5000, whole genome, and random approaches. Proteins 58,
t al., 2008. Phylogenetic profiles reveal evolutionary relationships
“twilight zone” of sequence similarity. Proc. Natl. Acad. Sci. U.S.A.
4–13479.
1975. Rates of protein evolution. Function of amino acid composition.
8, 1022–1023.
sk, A.M., 1986. The relation between the divergence of sequence and
n proteins. EMBO J. 5, 823–826.
t al., 2006. Detecting non-orthology in the COGs database and other
s grouping orthologs using genome-specific best hits. Nucleic Acids
09–3316.
, K., 2008. Protein multiple sequence alignment. Methods Mol. Biol.
J, USA) 484, 379–413.
., 1996. Bootstrap confidence levels for phylogenetic trees. Proc. Natl.
U.S.A. 93, 7085–7090.
et al., 1998. Crystal structure of the zinc-dependent beta-lactamase
lus cereus at 1.9 A˚ resolution: binuclear active site with features of a
ear enzyme. Biochemistry 37, 12404–12411.
1989. PHYLIP – Phylogeny Inference Package (Version 3.2). Cladistics
6.
7. Improving pairwise sequence alignment between distantly related
ethods Mol. Biol. (Totowa, NJ, USA) 395, 255–268.
t al., 2007. Functional insights from structural genomics. J. Struct.
omics 8, 37–44.
2007. Protein annotation at genomic scale: the current status. Chem.
448–3466.
S., et al., 2000. Horizontal gene transfer in bacterial and archaeal
genomes. Genome Res. 10, 1719–1725.
., Ponnuswamy, P.K., 1995. Prediction of protein secondary structures
hydrophobic characteristics. Int. J. Pept. Protein Res. 45, 225–240.
es, D.T., 1999. A systematic comparison of protein structure classifi-
OP, CATH and FSSP. Structure 7, 1099–1112.
J., 2000. DaliLite workbench for protein structure comparison. Bioin-
16, 566–567.
, 1997. A euryarchaeal lysyl-tRNA synthetase: resemblance to class I
s. Science 278, 1119–1122.
t al., 2009. Structure is three to ten times more conserved than
– a study of structural response in protein cores. Proteins 77, 499–508.
l., 2001. Database verification studies of SWISS-PROT and GenBank.
atics 17, 526–532.
02. Polypeptide release factors in prokaryotes and eukaryotes: same
ifferent structure. Structure 10, 8–9.
al., 2006. Protein structure comparison: implications for the nature
ce’, and structure and function prediction. Curr. Opin. Struct. Biol. 16,
, K.T., Tiedje, J.M., 2005a. Genomic insights that advance the species
for prokaryotes. Proc. Natl. Acad. Sci. U.S.A. 102, 2567–2572.
, K.T., Tiedje, J.M., 2005b. Towards a genome-based taxonomy for
es. J. Bacteriol. 187, 6258–6264.
thia, C., 1980. How different amino acid sequences determine similar
ructures: the structure and evolutionary dynamics of the globins. J.
136, 225–270.
., 2007. Proportion of solvent-exposed amino acids in a protein and
tein evolution. Mol. Biol. Evol. 24, 1005–1011.
n, K., et al., 2001. Barley lipid transfer protein, LTP1, contains a
of lipid-like post-translational modification. J. Biol. Chem. 276,
553.
al., 2005. A new progressive-iterative algorithm for multiple structure
. Bioinformatics 21, 3255–3263.
Page 10
hidden
M.D. Shortridge et al. / Computational Biology and Chemistry 35 (2011) 24–33 33
Mestres, J., 2005. Representativity of target families in the Protein Data Bank: impact
for family-directed structure-based drug discovery. Drug Discovery Today 10,
1629–1637.
Mosbacher, T.G., et al., 2005. Structure and mechanism of the ThDP-dependent ben-
zaldehyde lyase from Pseudomonas fluorescens. FEBS J. 272, 6067–6076.
Muller, Y.A., et al., 1994. The refined structures of a stabilized mutant and of
wild-type pyruvate oxidase from Lactobacillus plantarum. J. Mol. Biol. 237,
315–335.
Muller, J., et al., 2010. eggNOG v2.0: extending the evolutionary genealogy of
genes with enhanced non-supervised orthologous groups, species and func-
tional annotations. Nucleic Acids Res. 38, D190–D195.
Murzin, A.G., et al., 1995. SCOP: a structural classification of proteins database for
the investigation of sequences and structures. J. Mol. Biol. 247, 536–540.
Oakley, A.J., et al., 2005. Crystal and solution structures of the helicase-binding
domain of Escherichia coli primase. J. Biol. Chem. 280, 11495–11504.
Orengo, C.A., et al., 1997. CATH – a hierarchic classification of protein domain struc-
tures. Structure 5, 1093–1108.
Otsuka, T., et al., 2002. CCl4-induced acute liver injury in mice is inhibited by hep-
atocyte growth factor overexpression but stimulated by NK2 overexpression.
FEBS Lett. 532, 391–395.
Ouzounis, C.A., et al., 2003. Classification schemes for protein structure and function.
Nat. Rev. Genet. 4, 508–519.
Pal, C., et al., 2006. An integrated view of protein evolution. Nat. Rev. Genet. 7,
337–348.
Panchenko, A.R., et al., 2005. Evolutionary plasticity of protein families: coupling
between sequence and structure variation. Proteins 61, 535–544.
Parisien, M., Major, F., 2007. Ranking the factors that contribute to protein beta-sheet
folding. Proteins 68, 824–829.
Pascual-Garcia, A., 2009. Cross-over between discrete and continuous protein struc-
ture space: insights into automatic classification and networks of protein
structures. PLoS Comput. Biol.V 5.
Powers, R., et al., 2005. Solution structure of Archaeglobus fulgidis peptidyl-tRNA
hydrolase (Pth2) provides evidence for an extensive conserved family of Pth2
enzymes in archea, bacteria, and eukaryotes. Protein Sci. 14, 2849–2861.
Rentzsch, R., Orengo, C.A., 2009. Protein function prediction – the power of multi-
plicity. Trends Biotechnol. 27, 210–219.
Robertson, A.D., Murphy, K.P., 1997. Protein structure and the energetics of protein
stability. Chem. Rev. 97, 1251–1268.
Rocha, E.P., 2006. The quest for the universals of protein evolution. Trends Genet.
22, 412–416.
Rost, B., 1999. Twilight zone of protein sequence alignments. Protein Eng. 12,
85–94.
Rost, B., 2002. Enzyme function less conserved than anticipated. J. Mol. Biol. 318,
595–608.
Sadreyev, R.I., Grishin, N.V., 2006. Exploring dynamics of protein structure determi-
nation and homology-based prediction to estimate the number of superfamilies
and folds. BMC Struct. Biol. 6, 6.
Sanchez, I.E., et al., 2006. Point mutations in protein globular domains: contributions
from function, stability and misfolding. J. Mol. Biol. 363, 422–432.
Schmidt, T., Frishman, D., 2006. PROMPT: a protein mapping and comparison tool.
BMC Bioinform. 7, 331.
Schneider, A., et al., 2007. OMA browser-exploring orthologous relations across 352
complete genomes. Bioinformatics 23, 2180–2182.
Schnoes, A.M., et al., 2009. Annotation error in public databases: misannotation of
molecular function in enzyme superfamilies. PLoS Comput. Biol. 5, e1000605.
Schomburg, I., et al., 2004. BRENDA, the enzyme database: updates and major new
developments. Nucleic Acids Res. 32, D431–433.
Su, X.C., et al., 2006. Monomeric solution structure of the helicase-binding domain
of Escherichia coli DnaG primase. FEBS J. 273, 4997–5009.
Syson, K., et al., 2005. Solution structure of the helicase-interaction domain of the
primase DnaG: a model for helicase activation. Structure 13, 609–616.
Tatusov, R.L., et al., 2003. The COG database: an updated version includes eukaryotes.
BMC Bioinform. 4, 41.
Triplet, T., et al., 2010. PROFESS: a PROtein Function, Evolution, Structure and
Sequence database, Database 2010, baq011.
Valencia, A., 2005. Automatic annotation of protein function. Curr. Opin. Struct. Biol.
15, 267–274.
Vlassi, M., et al., 1999. A correlation between the loss of hydrophobic core packing
interactions and protein stability. J. Mol. Biol. 285, 817–827.
West, G., et al., 2004. Crystallization and X-ray analysis of bovine glycolipid transfer
protein. Acta Crystallogr., Sect. D: Biol. Crystallogr. D60, 703–705.
Williams, S.G., Lovell, S.C., 2009. The effect of sequence evolution on protein struc-
tural divergence. Mol. Biol. Evol. 26, 1055–1065.
Yang, K., et al., 2006. Crystal structure of a type III pantothenate kinase: insight into
the mechanism of an essential coenzyme A biosynthetic enzyme universally
distributed in bacteria. J. Bacteriol. 188, 5532–5540.

Sign up today - FREE

Mendeley saves you time finding and organizing research. Learn more

  • All your research in one place
  • Add and import papers easily
  • Access it anywhere, anytime

Start using Mendeley in seconds!

Already have an account? Sign in

Readership Statistics

1 Reader on Mendeley
by Discipline
 
by Academic Status
 
100% Post Doc
by Country
 
100% Canada