emMAW: Computing minimal absent words in external memory

10Citations
Citations of this article
14Readers
Mendeley users who have this article in their library.

This article is free to access.

Abstract

Motivation: The biological significance of minimal absent words has been investigated in genomes of organisms from all domains of life. For instance, three minimal absent words of the human genome were found in Ebola virus genomes. There exists an OðnÞ-time and OðnÞ-space algorithm for computing all minimal absent words of a sequence of length n on a fixed-sized alphabet based on suffix arrays. A standard implementation of this algorithm, when applied to a large sequence of length n, requires more than 20n bytes of RAM. Such memory requirements are a significant hurdle to the computation of minimal absent words in large datasets. Results: We present emMAW, the first external-memory algorithm for computing minimal absent words. A free open-source implementation of our algorithm is made available. This allows for computation of minimal absent words on far bigger data sets than was previously possible. Our implementation requires less than 3 h on a standard workstation to process the full human genome when as little as 1 GB of RAM is made available. We stress that our implementation, despite making use of external memory, is fast; indeed, even on relatively smaller datasets when enough RAM is available to hold all necessary data structures, it is less than two times slower than state-of-the-art internal-memory implementations.

Cite

CITATION STYLE

APA

Héliou, A., Pissis, S. P., & Puglisi, S. J. (2017). emMAW: Computing minimal absent words in external memory. Bioinformatics, 33(17), 2746–2749. https://doi.org/10.1093/bioinformatics/btx209

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free