Fast phylogenetic biodiversity computations under a non-uniform random distribution

0Citations
Citations of this article
1Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Computing the phylogenetic diversity of a set of species is an important part of many ecological case studies. More specifically, let T be a phylogenetic tree, and let R be a subset of its leaves representing the species under study. Specialists in ecology want to evaluate a function f(T,R) (a phylogenetic measure) that quantifies the evolutionary distance between the elements in R. But, in most applications, it is also important to examine how f(T,R) behaves when R is selected at random. The standard way to do this is to compute the mean and the variance of f among all subsets of leaves in T that consist of exactly |R| = r elements. For certain measures, there exist algorithms that can compute these statistics, under the condition that all subsets of r leaves are equiprobable. Yet, so far there are no algorithms that can do this exactly when the leaves in T are weighted with unequal probabilities. As a consequence, for this general setting, specialists try to compute the statistics of phylogenetic measures using methods which are both inexact and very slow. We present for the first time exact and efficient algorithms for computing the mean and the variance of phylogenetic measures when leaf subsets of fixed size are selected from T under a non-uniform random distribution. In particular, let T be a tree that has n nodes and depth d, and let r be a non-negative integer. We show how to compute in O((d+log n)n log n) time and O(n) space the mean and the variance for any measure that belongs to a well-defined class. We show that two of the most popular phylogenetic measures belong to this class: the Phylogenetic Diversity (PD) and the Mean Pairwise Distance (MPD). The random distribution that we consider is the Poisson binomial distribution restricted to subsets of fixed size r. More than that, we provide a stronger result; specifically for the PD and the MPD we describe algorithms that compute in a batched manner the mean and variance on T for all possible leaf-subset sizes in O((d + logn)n log n) time and O(n) space. For the PD and MPD, we implemented our algorithms that perform batched computations of the mean and variance.We also developed alternative implementations that compute in O((d + log n)n2) time the same output. For both types of implementations, we conducted experiments and measured their performance in practice. Despite the difference in the theoretical performance, we show that the algorithms that run in O((d+log n)n2) time are more efficient in practice, and numerically more stable. We also compared the performance of these algorithms with standard inexact methods that can be used in case studies. We show that our algorithms are outstandingly faster, making it possible to process much larger datasets than before. Our implementations will become publicly available through the R package PhyloMeasures.

Cite

CITATION STYLE

APA

Tsirogiannis, C., & Sandel, B. (2016). Fast phylogenetic biodiversity computations under a non-uniform random distribution. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 9649, pp. 225–236). Springer Verlag. https://doi.org/10.1007/978-3-319-31957-5_16

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free