Succincter text indexing with wildcards

Chris Thachuk

Conference Proceedings

Succincter text indexing with wildcards

Thachuk C

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2011) 6661 LNCS 27-40

DOI: 10.1007/978-3-642-21458-5_5

8Citations

15Readers

Get full text

Abstract

We study the problem of indexing text with wildcard positions, motivated by the challenge of aligning sequencing data to large genomes that contain millions of single nucleotide polymorphisms (SNPs) -positions known to differ between individuals. SNPs modeled as wildcards can lead to more informed and biologically relevant alignments. We improve the space complexity of previous approaches by giving a succinct index requiring (2 + o(1))n logσ + O(n) + O(d logn) + O(k log k) bits for a text of length n over an alphabet of size σ containing d groups of k wildcards. The new index is particularly favourable for larger alphabets and comparable for smaller alphabets, such as DNA. A key to the space reduction is a result we give showing how any compressed suffix array can be supplemented with auxiliary data structures occupying bits to also support efficient dictionary matching queries. We present a new query algorithm for our wildcard index that greatly reduces the query working space to O(dm + m log n) bits, where m is the length of the query. We note that compared to previous results this reduces the working space by two orders of magnitude when aligning short read data to the Human genome. © 2011 Springer-Verlag.

Cite

CITATION STYLE

APA

Thachuk, C. (2011). Succincter text indexing with wildcards. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 6661 LNCS, pp. 27–40). https://doi.org/10.1007/978-3-642-21458-5_5

Succincter text indexing with wildcards

Abstract

Cite

Register to see more suggestions