Decontaminating eukaryotic genome assemblies with machine learning

Janna L. Fierst; Duncan A. Murdock

Journal ArticleOPEN ACCESS

Decontaminating eukaryotic genome assemblies with machine learning

BMC Bioinformatics (2017) 18(1)

DOI: 10.1186/s12859-017-1941-0

17Citations

110Readers

Abstract

Background: High-throughput sequencing has made it theoretically possible to obtain high-quality de novo assembled genome sequences but in practice DNA extracts are often contaminated with sequences from other organisms. Currently, there are few existing methods for rigorously decontaminating eukaryotic assemblies. Those that do exist filter sequences based on nucleotide similarity to contaminants and risk eliminating sequences from the target organism. Results: We introduce a novel application of an established machine learning method, a decision tree, that can rigorously classify sequences. The major strength of the decision tree is that it can take any measured feature as input and does not require a priori identification of significant descriptors. We use the decision tree to classify de novo assembled sequences and compare the method to published protocols. Conclusions: A decision tree performs better than existing methods when classifying sequences in eukaryotic de novo assemblies. It is efficient, readily implemented, and accurately identifies target and contaminant sequences. Importantly, a decision tree can be used to classify sequences according to measured descriptors and has potentially many uses in distilling biological datasets.

Author supplied keywords

References Powered by Scopus

View more at Scopus

Cited by Powered by Scopus

View more at Scopus

Cite

CITATION STYLE

APA

Fierst, J. L., & Murdock, D. A. (2017). Decontaminating eukaryotic genome assemblies with machine learning. BMC Bioinformatics, 18(1). https://doi.org/10.1186/s12859-017-1941-0

Readers' Seniority

PhD / Post grad / Masters / Doc 36

52%

Researcher 28

41%

Professor / Associate Prof. 3

Lecturer / Post doc 2

Readers' Discipline

Agricultural and Biological Sciences 37

50%

Biochemistry, Genetics and Molecular Bi... 29

39%

Computer Science 6

Engineering 2

Article Metrics

Social Media

Shares, Likes & Comments: 2

View details >

Decontaminating eukaryotic genome assemblies with machine learning

Abstract

Author supplied keywords

References Powered by Scopus

Basic local alignment search tool

The Sequence Alignment/Map format and SAMtools

Initial sequencing and analysis of the human genome

Cited by Powered by Scopus

Autometa: Automated extraction of microbial genomes from individual shotgun metagenomes

Intragenomic variation in nuclear ribosomal markers and its implication in species delimitation, identification and barcoding in fungi

Comparing whole-genome shotgun sequencing and DNA metabarcoding approaches for species identification and quantification of pollen species mixtures

Register to see more suggestions

Cite

Readers' Seniority

Readers' Discipline

Article Metrics