Fast, Small, and Simple Document Listing on Repetitive Text Collections

Dustin Cobas; Gonzalo Navarro

Conference Proceedings

Fast, Small, and Simple Document Listing on Repetitive Text Collections

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2019) 11811 LNCS 482-498

DOI: 10.1007/978-3-030-32686-9_34

5Citations

3Readers

Get full text

Abstract

Document listing on string collections is the task of finding all documents where a pattern appears. It is regarded as the most fundamental document retrieval problem, and is useful in various applications. Many of the fastest-growing string collections are composed of very similar documents, such as versioned code and document collections, genome repositories, etc. Plain pattern-matching indexes designed for repetitive text collections achieve orders-of-magnitude reductions in space. Instead, there are not many analogous indexes for document retrieval. In this paper we present a simple document listing index for repetitive string collections of total length n that lists the distinct documents where a pattern of length m appears in time. We exploit the repetitiveness of the document array (i.e., the suffix array coarsened to document identifiers) to grammar-compress it while precomputing the answers to nonterminals, and store them in grammar-compressed form as well. Our experimental results show that our index sharply outperforms existing alternatives in the space/time tradeoff map.

Cite

CITATION STYLE

APA

Cobas, D., & Navarro, G. (2019). Fast, Small, and Simple Document Listing on Repetitive Text Collections. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 11811 LNCS, pp. 482–498). Springer. https://doi.org/10.1007/978-3-030-32686-9_34

Fast, Small, and Simple Document Listing on Repetitive Text Collections

Abstract

Cite

Register to see more suggestions