RefSeq database growth influences the accuracy of k-mer-based lowest common ancestor species identification

Daniel J. Nasko; Sergey Koren; Adam M. Phillippy; Todd J. Treangen

Journal ArticleOPEN ACCESS

RefSeq database growth influences the accuracy of k-mer-based lowest common ancestor species identification

Genome Biology (2018) 19(1)

DOI: 10.1186/s13059-018-1554-6

117Citations

189Readers

Abstract

In order to determine the role of the database in taxonomic sequence classification, we examine the influence of the database over time on k-mer-based lowest common ancestor taxonomic classification. We present three major findings: the number of new species added to the NCBI RefSeq database greatly outpaces the number of new genera; as a result, more reads are classified with newer database versions, but fewer are classified at the species level; and Bayesian-based re-estimation mitigates this effect but struggles with novel genomes. These results suggest a need for new classification approaches specially adapted for large databases.

Author supplied keywords

Cite

CITATION STYLE

APA

Nasko, D. J., Koren, S., Phillippy, A. M., & Treangen, T. J. (2018). RefSeq database growth influences the accuracy of k-mer-based lowest common ancestor species identification. Genome Biology, 19(1). https://doi.org/10.1186/s13059-018-1554-6

RefSeq database growth influences the accuracy of k-mer-based lowest common ancestor species identification

Abstract

Author supplied keywords

Cite

Register to see more suggestions