In this tutorial, through a series of analytical computations and numerical simulations, we review many known insights into a fundamental question: how much data is needed to reconstruct the Tree of Life? A Jupyter notebook and code for this tutorial are provided in Python. Phylogeny estimation is a central problem in evolutionary biology and beyond [29]. In the most basic form of the problem, one has access to aligned homologous DNA sequences, say from a common gene, across multiple species. The goal is to output a phylogeny that describes the underlying evolutionary relationships. A large number of inference methods have been developed for this problem [31]. Often one relies on the assumption that the data fits a stochastic model of sequence evolution on a tree, under which many methods have been proven to be statistically consistent, i.e., as the amount of data increases, the estimated phylogeny converges to the true phylogeny with probability one. In order to compare the statistical accuracy of different methods, however, a natural theoretical approach is to analyze the rate at which this convergence occurs. Through a series of analytical computations and numerical simulations, we review some known insights into this fundamental question: how much data is needed to reconstruct the Tree of Life? After some basic definitions, we analyze in detail a simple setting: the three-leaf rooted case under the Cavender-Farris model. Despite its simplicity, this setting already brings to light the important role played by various parameters, in particular, the shortest branch length and the depth, in the difficulty of reconstructing phylogenies. We consider both distance-based and likelihood-based
CITATION STYLE
Roch, S. (2019). Hands-on Introduction to Sequence-Length Requirements in Phylogenetics (pp. 47–86). https://doi.org/10.1007/978-3-030-10837-3_4
Mendeley helps you to discover research relevant for your work.