A quantitative and qualitative characterization of K-mer based alignment-free phylogeny construction

2Citations
Citations of this article
2Readers
Mendeley users who have this article in their library.
Get full text

Abstract

The rapidly growing volume of genomic data, including pathogens, both invites exploration of possible phylogenetic relationships among unclassified organisms, and challenges standard techniques that require multiple sequence alignment. Further, the ability to probe variations in selection pressure e.g. among viral outbreaks, is an important characterization of the life of a virus in its biological reservoir. In this paper, we derived the probability distribution of k-mer alignment lengths between random sequences for a given optimized score to quantify the probability that a given alignment was not better than chance, and applied it to Human Papiloma Virus (HPV), primate mtDNA, and Ebola. Even for highly variable HPV types, the number of k-mers required to significantly distinguish an alignment of related genomes from random sequences was reduced from 64 for 1-mers to 6 for 3-mers and 4 for 4-mers, indicating k-mers provide sufficient specificity to be able to characterize differences in sequences by their k-mer frequencies, allowing distances based on the k-mer frequencies to proxy for evolutionary distance. We computed mtDNA coding sequence and Ebola phylogeny construction. Primate mtDNA coding region k-mer UPGMA phylogenies reproduced most of the expected primate phylogeny. The Mantel test, applied to RAxML and Bayesian phylogenetic distances between Ebola samples versus 3-mer frequency distances, was highly significant (≤ 1 × 10-5). We characterized differences in selection pressure between coding and non-coding regions, and of selection in early cell cycle vs. late genes in Ebola. Coding versus non-coding regions showed evidence of purifying selection, while the early vs. late cell cycle proteins showed differences with late cycle proteins resembling influenza like immunological response, noting the g-proteins are among the late genes.

Author supplied keywords

Cite

CITATION STYLE

APA

Utro, F., Platt, D. E., & Parida, L. (2019). A quantitative and qualitative characterization of K-mer based alignment-free phylogeny construction. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 10834 LNBI, pp. 19–31). Springer Verlag. https://doi.org/10.1007/978-3-030-14160-8_3

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free