Sign up & Download
Sign in

What's in a likelihood? Simple models of protein evolution and the contribution of structurally viable reconstructions to the likelihood.

by Clemens Lakner, Mark T Holder, Nick Goldman, Gavin J P Naylor
Systematic Biology ()

Abstract

Most phylogenetic models of protein evolution assume that sites are independent and identically distributed. Interactions between sites are ignored, and the likelihood can be conveniently calculated as the product of the individual site likelihoods. The calculation considers all possible transition paths (also called substitution histories or mappings) that are consistent with the observed states at the terminals, and the probability density of any particular reconstruction depends on the substitution model. The likelihood is the integral of the probability density of each substitution history taken over all possible histories that are consistent with the observed data. We investigated the extent to which transition paths that are incompatible with a protein's three-dimensional structure contribute to the likelihood. Several empirical amino acid models were tested for sequence pairs of different degrees of divergence. When simulating substitutional histories starting from a real sequence, the structural integrity of the simulated sequences quickly disintegrated. This result indicates that simple models are clearly unable to capture the constraints on sequence evolution. However, when we sampled transition paths between real sequences from the posterior probability distribution according to these same models, we found that the sampled histories were largely consistent with the tertiary structure. This suggests that simple empirical substitution models may be adequate for interpolating changes between observed sequences during phylogenetic inference despite the fact that the models cannot predict the effects of structural constraints from first principles. This study is significant because it provides a quantitative assessment of the biological realism of substitution models from the perspective of protein structure, and it provides insight on the prospects for improving models of protein sequence evolution.

Cite this document (BETA)

Available from www.ncbi.nlm.nih.gov
Page 3
hidden

What's in a likelihood? Simple mo...

2011 LAKNER ET AL.—LIKELIHOOD MODELS AND PROTEIN STRUCTURE 163 TABLE 1. Crystal structures used for the simulations Protein PDB ID Source Length Resolution ( ˚ A) Parvalbumin A 5PAL, 1RTP Leopard Shark, Black Rat 109 1.54, 2.0 Myoglobin 1LHS Loggerhead Sea Turtle 153 2.0 Lysozyme c 1JSF Human 130 1.15 HPPK 1HKA Escherichia coli 158 1.50 their UniProtKB or GenBank accession numbers) were Amphiuma means (PRVA AMPME) and Latimeria chalum- nae (PRVA LATCH) for parvalbumin, Struthio camelus (MYG STRCA) and Rattus norvegicus (MYG RAT) for myoglobin, Bufo andrewsi (LYS BUFAN) and Camelus dromedarius (LYSC CAMDR) for lysozyme, and Aeromonas salmonicida (YP 001140682.1) and Pectobac- terium atrosepticum (CAG76218.1) for HPPK. ML branch length estimates for all trees are provided in the online Supplementary material (available at www.systematicbiology.org). Alignments and trees are also available from TreeBase (Sanderson et al. 1994, study accession: http://purl.org/phylo/treebase/ phylows/study/TB2:S11030). Reference sequences.—One of the simplifying assump- tions of the models proposed by Robinson et al. (2003) and Rodrigue et al. (2005) is that the protein structure remains constant across the tree. To assess the variation in sequence-to-structure fit that we should expect to see when we fit sequences onto a fixed structure, we evaluated sequences from several species in addition to the sequences from the organisms listed above. This allowed us to interpret the structural viability of the simulated ancestors in the context of a reference dis- tribution. It is clear that this reference is incomplete, but it provides a sense of the range of scores that is at least displayed by real sequences that fold into the given structure. For HPPK, we chose the sequences from the data set in Rodrigue et al. (2006). For lysozyme (and possibly HPPK), the reference set contains paralogs. We found no indication that these sequences would pref- erentially fold into different structures. A complete list of reference-sequence identifiers can be found in the online Supplementary Material. Likelihood Calculations and Data Augmentation The models tested here were Poisson (Bishop and Friday 1987), WAG, and LG. The Poisson model is the amino acid equivalent of the Jukes and Cantor (1969) nucleotide substitution model where all substitution rates and equilibrium frequencies are equal. On a given tree ψ the actual likelihood p(D|M, ψ) is related to the so called augmented likelihood p(D, φ|M, ψ) via (unob- served) substitution histories φ p(D|M, ψ) = Φ p(D, φ|M, ψ)dφ, (1) where Φ denotes the set of all possible transition paths and M stands for the fixed parameters of the model (Mateiu and Rannala 2006 Rodrigue et al. 2007, 2008). We use the terms substitution histories (Rodrigue et al. 2005), mappings (Nielsen 2001), and transition paths (Jensen and Pedersen 2000 Pedersen and Jensen 2001 Robinson et al. 2003) synonymously. One mapping in- cludes a sequence of events and associated times for all sites of the alignment. For a single branch, the likeli- hood is the probability of observing the root sequence s0 (given the state frequencies of the model) multiplied by the probability of all possible transition paths to the descendant sequence s1 p(s0, s1|M, ν) = p(s0|M) Φ p(s1, φ|s0, M, ν)dφ, (2) where ν denotes the length of the branch. Constrained simulations.—We followed the approach de- scribed in Nielsen (2001) to sample 500 substitution histories from the distribution p(s1, φ|s0, M, ν). The pro- cess at each site of a substitution history is simply a realization of the continuous-time Markov chain that starts in the observed state of the ancestral sequence and ends in the state of the descendant sequence. Sub- stitution histories for a branch were obtained as follows: first, events were sampled separately for all sites. Then, all site events were combined, resulting in histories that connect the terminal sequences through a series of one- step neighbors (Fig. 1). These ancestral sequences were subsequently evaluated for their compatibility with the crystal structure. In the four-taxon case, we sampled 500 histories for each of the three unrooted topologies. For all models, the ML branch length was found to be a good estimator of the mean number of substitutions TABLE 2. Sequence pairs used for the simulations Protein s0 Swissprot s1 Swissprot (%) accession number accession number sequence identity Parvalbumin A Leopard Shark P30563 Black Rat P02625 58.7 Myoglobin American Alligator P02200 Burchell’s Zebra P68083 63.4 Lysozyme c Human P84492 Green Turtle P61626 62.3 HPPK Vibrio cholerae Q9KUC9 Salmonella typhi Q8Z9C4 55.1 at University of Pennsylvania Library on September 9, 2011 sysbio.oxfordjournals.org Downloaded from
Page 4
hidden
164 SYSTEMATIC BIOLOGY VOL. 60 FIGURE 1. Example of a transition path between two hypothetical sequences s0 (AMYL) and s1 (GTYI). A mapping is generated for each site according to Nielsen (2001). Sorting all site events by their times results in a history that connects the terminals through a series of one step neighbors (a1 to a6). These ancestral sequences (along with s0 and s1) are subsequently evaluated for their compatibility with the crystal structure using a pseudo-energy (PE) score. over mappings drawn from the posterior. Sampling from the joint posterior of branch lengths and substitu- tion histories would have required us to specify a prior over branch lengths. To avoid this, we further simpli- fied our approach by fixing the branch lengths to the ML estimates (it should perhaps be noted that the ML estimate of the branch length is not the same as the pos- terior mean of the number of substitutions for a given branch length). For each sequence pair and model, we used PAML 4 (Yang 2007) to infer the ML branch lengths (Table 3). Unconstrained simulations.—In the constrained simula- tions, both endpoints of the branch were real sequences. To test the efficacy of the model to maintain structural compatibility without constraining the simulation to end in a predetermined end state (s1 in Table 2), we sampled from the distribution p(φ|s0, M, ν). For each model, we performed 500 simulations with the same expected number of substitutions as above (Table 3). To test the behavior in the limit, when the sequence is effec- tively randomized (and the composition corresponds to the equilibrium amino acid frequencies implied by the model), 100 simulations were performed for each model on an unbounded branch of length 30.0. This describes the situation where each site underwent a large number of substitutions on the branch between two terminal sequences, effectively removing all phylogenetic signal. Evaluation of Sequence-to-Structure Fit An ideal protein design procedure should generate sequences that are compatible with the native confor- TABLE 3. Branch lengths ν (expected number of substitutions per site) and likelihoods for the sequence pairs under different models Protein ν −lnL Poisson WAG LG Poisson WAG LG Parvalbumin A 0.54166 0.54033 0.58710 532.923 478.894 482.838 Myoglobin 0.46225 0.46228 0.49951 723.726 668.858 665.950 Lysozyme c 0.48017 0.51281 0.54236 619.853 594.483 601.541 HPPK 0.60856 0.66055 0.72945 791.087 741.261 735.662 mation (stability) and incompatible with other structures (specificity, for a detailed treatment of the subject, see Koehl and Levitt 1999a,b). If the sampled ancestral se- quences are to be compatible with the structure of the wild types, they must fulfill these criteria as well. Three approaches were taken to evaluate the compatibility of the sampled sequences with the crystal structures. First, we used empirically derived contact potentials to assess sequence-structure compatibility. These so-called knowledge-based potentials were used to calculate the pseudo-energy (PE), which is tightly correlated with the free energy of the sequence in the final folded form. Second, in order to test for structural specificity, we esti- mated the PE distribution for 10,000 shuffled sequences on each native structure (Table 1). Third, as an addi- tional measure for specificity, we compared the PE of each sampled sequence on the native conformation to its PE scores on a set of misfolded decoys (see below). Sequence shuffling was used to test if a particular ar- rangement of residues had a significantly better fit to the structure than random sequences of the same com- position. The decoys, on the other hand, were used to evaluate if a sequence was significantly more compati- ble with a particular structure than with other compact (but misfolded) conformations (local energy minima). For the shuffled sequences, we calculated the Z-score (in the sense of Bowie et al. 1991) only with respect to the native structure as Zs = − ˉxs σs , (3) where is the PE of the native sequence, ˉxs is the aver- age PE of the shuffled sequences, and σs is the standard deviation of the PE scores of the shuffled sequences (all on the native structure). We further define Zd as the energy gap between a sequence on the native structure and its average PE on the set of misfolded decoys, expressed in standard deviations: Zd = − ˉxd σd . (4) Here, ˉxd denotes the average PE of a sequence on the set of decoys and σd is the standard deviation. Note that Zs at University of Pennsylvania Library on September 9, 2011 sysbio.oxfordjournals.org Downloaded from

Readership Statistics

26 Readers on Mendeley
by Discipline
 
 
by Academic Status
 
46% Ph.D. Student
 
27% Post Doc
 
8% Student (Master)
by Country
 
42% United States
 
15% Brazil
 
8% Germany

Sign up today - FREE

Mendeley saves you time finding and organizing research. Learn more

  • All your research in one place
  • Add and import papers easily
  • Access it anywhere, anytime

Start using Mendeley in seconds!

Already have an account? Sign in