Optimal-time text indexing in BWT-runs bounded space

98Citations
Citations of this article
50Readers
Mendeley users who have this article in their library.

Abstract

Indexing highly repetitive texts | such as genomic databases, software repositories and versioned text collections | has become an important problem since the turn of the millennium. A relevant compressibility measure for repetitive texts is r, the number of runs in their Burrows-Wheeler Transform (BWT). One of the earliest indexes for repetitive collections, the Run-Length FMindex, used O(r) space and was able to efficiently count the number of occurrences of a pattern of length m in the text (in loglogarithmic time per pattern symbol, with current techniques). However, it was unable to locate the positions of those occurrences efficiently within a space bounded in terms of r. Since then, a number of other indexes with space bounded by other measures of repetitiveness | the number of phrases in the LempelZiv parse, the size of the smallest grammar generating the text, the size of the smallest automaton recognizing the text factors | have been proposed for efficiently locating, but not directly counting, the occurrences of a pattern. In this paper we close this long-standing problem, showing how to extend the Run-Length FMindex so that it can locate the occ occurrences efficiently within O(r) space (in loglogarithmic time each), and reaching optimal time O(m+occ) within O(r log(n=r)) space, on a RAM machine with words of w = (log n) bits. Raising the space to O(rw logσ (n=r)), we support locate in O(mlog(σ)=w+occ) time, which is optimal in the packed setting and had not been obtained before in compressed space. We also describe a structure using O(r log(n=r)) space that replaces the text and efficiently extracts any text substring, with an O(log(n=r)) additive time penalty over the optimum. Preliminary experiments show that our new structure outperforms the alternatives by orders of magnitude in the space/time tradeoff map.

Cite

CITATION STYLE

APA

Gagie, T., Navarro, G., & Prezza, N. (2018). Optimal-time text indexing in BWT-runs bounded space. In Proceedings of the Annual ACM-SIAM Symposium on Discrete Algorithms (pp. 1459–1477). Association for Computing Machinery. https://doi.org/10.1137/1.9781611975031.96

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free