Self-indexing based on LZ77

Sebastian Kreft; Gonzalo Navarro

Conference Proceedings

Self-indexing based on LZ77

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2011) 6661 LNCS 41-54

DOI: 10.1007/978-3-642-21458-5_6

52Citations

24Readers

Get full text

Abstract

We introduce the first self-index based on the Lempel-Ziv 1977 compression format (LZ77). It is particularly competitive for highly repetitive text collections such as sequence databases of genomes of related species, software repositories, versioned document collections, and temporal text databases. Such collections are extremely compressible but classical self-indexes fail to capture that source of compressibility. Our self-index takes in practice a few times the space of the text compressed with LZ77 (as little as 2.5 times), extracts 1-2 million characters of the text per second, and finds patterns at a rate of 10-50 microseconds per occurrence. It is smaller (up to one half) than the best current self-index for repetitive collections, and faster in many cases. © 2011 Springer-Verlag.

Cite

CITATION STYLE

APA

Kreft, S., & Navarro, G. (2011). Self-indexing based on LZ77. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 6661 LNCS, pp. 41–54). https://doi.org/10.1007/978-3-642-21458-5_6

Self-indexing based on LZ77

Abstract

Cite

Register to see more suggestions