Compact universal k-mer hitting sets

19Citations
Citations of this article
23Readers
Mendeley users who have this article in their library.
Get full text

Abstract

We address the problem of finding a minimum-size set of k-mers that hits L-long sequences. The problem arises in the design of compact hash functions and other data structures for efficient handling of large sequencing datasets. We prove that the problem of hitting a given set of L-long sequences is NP-hard and give a heuristic solution that finds a compact universal k-mer set that hits any set of L-long sequences. The algorithm, called DOCKS (design of compact k-mer sets), works in two phases: (i) finding a minimum-size k-mer set that hits every infinite sequence; (ii) greedily adding k-mers such that together they hit all remaining L-long sequences. We show that DOCKS works well in practice and produces a set of k-mers that is much smaller than a random choice of k-mers. We present results for various values of k and sequence lengths L and by applying them to two bacterial genomes show that universal hitting k-mers improve on minimizers. The software and exemplary sets are freely available at acgt.cs.tau.ac.il/docks/.

Cite

CITATION STYLE

APA

Orenstein, Y., Pellow, D., Marçais, G., Shamir, R., & Kingsford, C. (2016). Compact universal k-mer hitting sets. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 9838 LNCS, pp. 257–268). Springer Verlag. https://doi.org/10.1007/978-3-319-43681-4_21

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free