Distributed string mining for high-throughput sequencing data

Niko Välimäki; Simon J. Puglisi

Conference Proceedings

Distributed string mining for high-throughput sequencing data

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2012) 7534 LNBI 441-452

DOI: 10.1007/978-3-642-33122-0_35

8Citations

5Readers

Get full text

Abstract

The goal of frequency constrained string mining is to extract substrings that discriminate two (or more) datasets. Known solutions to the problem range from an optimal time algorithm to different time-space tradeoffs. However, all of the existing algorithms have been designed to be run in a sequential manner and require that the whole input fits the main memory. Due to these limitations, the existing algorithms are practical only up to a few gigabytes of input. We introduce a distributed algorithm that has a novel time-space tradeoff and, in practice, achieves a significant reduction in both memory and time compared to state-of-the-art methods. To demonstrate the feasibility of the new algorithm, our study includes comprehensive tests on large-scale metagenomics data. We also study the cost of renting the required infrastructure from, e.g. Amazon EC2. Our distributed algorithm is shown to be practical on terabyte-scale inputs and affordable on rented infrastructure. © 2012 Springer-Verlag.

Cite

CITATION STYLE

APA

Välimäki, N., & Puglisi, S. J. (2012). Distributed string mining for high-throughput sequencing data. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 7534 LNBI, pp. 441–452). https://doi.org/10.1007/978-3-642-33122-0_35

Distributed string mining for high-throughput sequencing data

Abstract

Cite

Register to see more suggestions