Correction: Long-Branch Attraction Bias and Inconsistency in Bayesian Phylogenetics
Available from www.pubmedcentral.nih.gov
Page 1
Correction: Long-Branch Attractio...
Long-Branch Attraction Bias and Inconsistency in Bayesian Phylogenetics Bryan Kolaczkowski1��, Joseph W. Thornton1,2* 1 Center for Ecology and Evolutionary Biology, University of Oregon, Eugene, Oregon, United States of America, 2 Howard Hughes Medical Institute, University of Oregon, Eugene, Oregon, United States of America Abstract Bayesian inference (BI) of phylogenetic relationships uses the same probabilistic models of evolution as its precursor maximum likelihood (ML), so BI has generally been assumed to share ML���s desirable statistical properties, such as largely unbiased inference of topology given an accurate model and increasingly reliable inferences as the amount of data increases. Here we show that BI, unlike ML, is biased in favor of topologies that group long branches together, even when the true model and prior distributions of evolutionary parameters over a group of phylogenies are known. Using experimental simulation studies and numerical and mathematical analyses, we show that this bias becomes more severe as more data are analyzed, causing BI to infer an incorrect tree as the maximum a posteriori phylogeny with asymptotically high support as sequence length approaches infinity. BI���s long branch attraction bias is relatively weak when the true model is simple but becomes pronounced when sequence sites evolve heterogeneously, even when this complexity is incorporated in the model. This bias���which is apparent under both controlled simulation conditions and in analyses of empirical sequence data���also makes BI less efficient and less robust to the use of an incorrect evolutionary model than ML. Surprisingly, BI���s bias is caused by one of the method���s stated advantages���that it incorporates uncertainty about branch lengths by integrating over a distribution of possible values instead of estimating them from the data, as ML does. Our findings suggest that trees inferred using BI should be interpreted with caution and that ML may be a more reliable framework for modern phylogenetic analysis. Citation: Kolaczkowski B, Thornton JW (2009) Long-Branch Attraction Bias and Inconsistency in Bayesian Phylogenetics. PLoS ONE 4(12): e7891. doi:10.1371/ journal.pone.0007891 Editor: Wayne Delport, University of California San Diego, United States of America Received September 8, 2009 Accepted October 12, 2009 Published December 9, 2009 Copyright: �� 2009 Kolaczkowski, Thornton. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Funding: Supported by the Howard Hughes Medical Institute, National Science Foundation (DEB-0516530,IGERT DGE-9972830), National Institutes of Health (NIH R01-GM62351), and the Alfred P. Sloan Foundation. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Competing Interests: The authors have declared that no competing interests exist. * E-mail: joet@uoregon.edu �� Current address: Biology Department, Dartmouth College, Hanover, New Hampshire, United States of America Introduction Statistical inference of phylogenetic relationships informs analysis in fields as diverse as comparative genomics, epidemiol- ogy, ecology, and evolution [1]. Bayesian inference (BI) of phylogeny [2���4] has recently gained in popularity and appears to have answered some long-standing phylogenetic questions [5,6]. The aim of Bayesian statistics is typically to characterize the posterior probability distribution of a set of hypotheses, given a body of data, a probabilistic model for the generation of that data, and an explicit probabilistic description of prior beliefs. The chief concern of phylogenetics, in contrast, is to produce a concrete inference of historical evolutionary relationships and to charac- terize the statistical support for that inference. As such, nearly all phylogenetic analyses using BI have applied a Bayesian decision rule to select the tree with the highest posterior probability (or a consensus tree of all clades with posterior probability w0:5) as the best hypothesis of phylogeny (e.g., [5,6]). BI and its precursor maximum likelihood (ML) infer phyloge- netic relationships using the same probabilistic models of molecular evolution, so it has been assumed that BI, like ML [7���9], is largely unbiased and statistically consistent given the correct model [6,10]. A key difference between BI and ML���and a major proposed advantage of BI [3,10���12]���is that Bayesian methods incorporate uncertainty about ������nuisance parameters������ such as branch lengths on the topology and the parameters of the evolutionary model in contrast, ML requires specific values for these parameters to be estimated from the data. When data are limited, the ML estimates may deviate from the true values, because the observed state pattern frequencies vary stochastically from expectation. With larger datasets, ML yields increasingly accurate estimates of nuisance parameter values as sequence length approaches infinity, the likelihood of the true phylogeny (with the correctly estimated branch lengths) is guaranteed to exceed that of any other phylogeny (with any branch lengths), so long as the model is adequately parameterized and identifiable [9,13]. In order to reduce dependence on estimates of nuisance parameters, BI calculates the integrated likelihood of each topology over multiple values of each parameter, weighted by a user-specified distribution that describes the prior probability of each parameter value [3]. Reliable prior information about branch lengths and other model parameters is seldom available in practice, so virtually all analyses have used ������uninformative������ diffuse prior distributions (such as branch length priors uniform from 0 to 5 or exponential with mean 0.1, which are offered as the default values in common software packages). Because BI PLoS ONE | www.plosone.org 1 December 2009 | Volume 4 | Issue 12 | e7891
Page 2
incorporates uncertainty about nuisance parameters, it has been favored over ML for implementing complex models with many parameters, particularly when data are limited [3,10,12,14]. The statistical characteristics and performance of BI, particu- larly vis-a-vis ML, have not been thoroughly evaluated. Several criteria can be used to evaluate the reliability of phylogenetic methods for inferring topologies. First, the asymptotic perfor- mance of phylogenetic methods when the assumed evolutionary model is correct has been evaluated in terms of statistical consistency���convergence in probability on the true phylogeny, typically with increasing support, as the amount of sequence data increases. Consistency has been evaluated directly by mathemat- ical proof [9,15���18] or numerical analysis [19,20], and indirectly by analyzing simulated datasets of increasing size [21���23]. Second, topological bias has been evaluated by determining whether a method tends to recover a particular incorrect topology when phylogenetic signal is absent or weak [7,19,24]. Third, efficiency��� the quantity of data required to reliably recover the true tree���has typically been assessed by analyzing the proportion of correct inferences using simulated datasets of variable size [25���27]. Fourth, robustness to incorrect assumptions about the underlying evolutionary model or incorrect prior distributions���an important practical concern, because complete and accurate a priori knowledge of evolutionary processes is never available���has been evaluated by examining consistency, bias, and efficiency when the true model and prior distributions are not applied [8,23,28���32]. Other studies have examined the accuracy and behavior of measures of statistical confidence in topological inferences [29,32���40]. Most analyses of Bayesian phylogenetic methods have focused on the properties of its confidence measures the consistency, bias, efficiency, and robustness of using BI with a Bayes decision rule to infer topologies have not been well characterized. ������Bayesian simulations������ have shown that, when the prior distributions precisely match the distribution of conditions under which the data were simulated, the average posterior probability of a group of inferences accurately predicts the proportion of those inferences that are correct [29,31]. Yang and Rannala [31] showed that the choice of priors affects posterior probabilities and that vague or uninformative priors can cause them to deviate from the fraction of correct inferences, but they did not investigate whether the deviation was structured to favor certain topologies. Kolaczkowski and Thornton [32] found that the direction of this deviation in posterior probabilities depends on the pattern of branch lengths on the tree when the true tree has non-sister long branches, the posterior probability of the incorrect long branch attraction (LBA) tree tends to be inflated. Susko [41] analyzed the distribution of posterior probabilities in the limiting case of sequence length approaching infinity and found that sequences generated on an unresolved four-taxon star tree with two long branches yield posterior probabilities that favor the resolved LBA tree. Taken together, these studies establish that the choice of prior distribution affects posterior probabilities and suggest that under some simple conditions BI might exhibit topological bias. Many questions remain open, however. First, it is not clear whether BI using a Bayesian decision rule is significantly biased when finite data are analyzed, when the true tree is resolved, or when sequences generated under realistic conditions are analyzed. Second, it is unclear why BI might be biased in favor of certain topologies as data increases, particularly because the effects of prior assumptions are expected to diminish as the quantity of data increases. Third, the possibility that Bayesian simulations���in which results are summarized over a range of evolutionary conditions���might mask bias under specific conditions has not been examined. Finally, the relative accuracy, efficiency, and robustness of BI compared to ML has not been evaluated. BI and ML implementations employ different search strategies and different estimates of statistical confidence, so direct comparison of phylogenetic accuracy using these two frameworks has not been possible. To address this issue, we implemented a novel ������empirical Bayes������ [42] method, which uses the same Markov-chain Monte Carlo (MCMC) sampling strategy as traditional BI but calculates the posterior probability of each tree assuming the ML estimate of branch lengths and other parameters (Fig. S1). Although posterior probabilities are not a meaningful concept in a strict ML framework, our empirical Bayes approach produces inferences identical to those generated by ML: given uniform prior probability for each topology and an adequate search, the tree with the highest posterior probability using our method will always be the ML tree. BI differs from our ML/ empirical Bayes method only by integrating over branch lengths and other model parameters, allowing us to directly compare the performance of ML to BI and to specifically determine the effects of incorporating parameter uncertainty on phylogenetic accuracy. We analyzed both simulated and empirical data under a range of controlled conditions using both BI and this novel ML implementation. The results, together with numerical and mathematical analyses, indicate that integrating over uncertainty about branch lengths induces an intractable topological bias in BI that results in reduced accuracy, efficiency, and robustness compared to ML they also suggest that BI is likely to be statistically inconsistent. Although in practice BI and ML will recover the same phylogeny across a wide range of conditions, our findings indicate that when the two methods differ in their results, ML is more likely to be accurate. Results Long Branch Attraction Bias We first evaluated whether incorporating parameter uncertainty using BI as commonly practiced causes topological bias under simple but challenging evolutionary conditions [19]. We simulated sequences using a simple model of nucleotide evolution along a four-taxon star tree with two long and two short terminal branch lengths (Fig. 1a). When data were analyzed using the correct evolutionary model, ML was unbiased, recovering each possible tree with equal frequency the mean posterior probability for each tree was ,1/3 at all sequence lengths, as expected for an unbiased method [24]. In contrast, BI���using the common assumption of uniform priors over branch lengths���inferred as the maximum a posteriori tree the falsely resolved topology that pairs long branches together from over 70% of replicates, with mean posterior probability ,0.6, when sequences were of moderate length. This long branch attraction (LBA) bias grew stronger with increasing sequence length, as indicated by a positive slope of the best-fit regression curve (P = 0.03). BI���s bias is not restricted to star- tree conditions but affects phylogenetic accuracy on resolved trees, as well (Fig. 1a). Under simple evolutionary conditions, BI required a 25% longer internal branch than ML to recover the correct phylogeny with 95% frequency (Table S1). These results indicate that BI suffers from long branch attraction bias and that this bias is caused by integrating over branch lengths. They also establish that, under these conditions, BI is less efficient than ML at recovering the true topology. We conducted similar analyses using both nucleotide and amino acid data, various prior distributions, and a range of complex and simple evolutionary models. In all cases, BI���unlike ML��� displayed LBA bias, which grew worse with increasing data (Figs. Bias in Bayesian Phylogenetics PLoS ONE | www.plosone.org 2 December 2009 | Volume 4 | Issue 12 | e7891
Readership Statistics
61 Readers on Mendeley
by Discipline
2% Chemistry
by Academic Status
39% Ph.D. Student
25% Post Doc
7% Researcher (at an Academic Institution)
by Country
30% United States
11% Germany
10% Brazil
Sign up today - FREE
Mendeley saves you time finding and organizing research. Learn more
- All your research in one place
- Add and import papers easily
- Access it anywhere, anytime


