Machine learning based imputation techniques for estimating phylogenetic trees from incomplete distance matrices

20Citations
Citations of this article
65Readers
Mendeley users who have this article in their library.

This article is free to access.

Abstract

Background: With the rapid growth rate of newly sequenced genomes, species tree inference from genes sampled throughout the whole genome has become a basic task in comparative and evolutionary biology. However, substantial challenges remain in leveraging these large scale molecular data. One of the foremost challenges is to develop efficient methods that can handle missing data. Popular distance-based methods, such as NJ (neighbor joining) and UPGMA (unweighted pair group method with arithmetic mean) require complete distance matrices without any missing data. Results: We introduce two highly accurate machine learning based distance imputation techniques. These methods are based on matrix factorization and autoencoder based deep learning architectures. We evaluated these two methods on a collection of simulated and biological datasets. Experimental results suggest that our proposed methods match or improve upon the best alternate distance imputation techniques. Moreover, these methods are scalable to large datasets with hundreds of taxa, and can handle a substantial amount of missing data. Conclusions: This study shows, for the first time, the power and feasibility of applying deep learning techniques for imputing distance matrices. Thus, this study advances the state-of-the-art in phylogenetic tree construction in the presence of missing data. The proposed methods are available in open source form at https://github.com/Ananya-Bhattacharjee/ImputeDistances.

References Powered by Scopus

MEGA6: Molecular evolutionary genetics analysis version 6.0

36593Citations
N/AReaders
Get full text

MEGA X: Molecular evolutionary genetics analysis across computing platforms

28443Citations
N/AReaders
Get full text

MEGA4: Molecular Evolutionary Genetics Analysis (MEGA) software version 4.0

26050Citations
N/AReaders
Get full text

Cited by Powered by Scopus

Current progress and open challenges for applying deep learning across the biosciences

155Citations
N/AReaders
Get full text

Protecting Biodiversity (in All Its Complexity): New Models and Methods

130Citations
N/AReaders
Get full text

Incorporating machine learning into established bioinformatics frameworks

67Citations
N/AReaders
Get full text

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Cite

CITATION STYLE

APA

Bhattacharjee, A., & Bayzid, M. S. (2020). Machine learning based imputation techniques for estimating phylogenetic trees from incomplete distance matrices. BMC Genomics, 21(1). https://doi.org/10.1186/s12864-020-06892-5

Readers over time

‘19‘20‘21‘22‘23‘24‘2506121824

Readers' Seniority

Tooltip

PhD / Post grad / Masters / Doc 27

79%

Professor / Associate Prof. 4

12%

Researcher 2

6%

Lecturer / Post doc 1

3%

Readers' Discipline

Tooltip

Computer Science 10

34%

Biochemistry, Genetics and Molecular Bi... 9

31%

Agricultural and Biological Sciences 6

21%

Engineering 4

14%

Save time finding and organizing research with Mendeley

Sign up for free
0