Bitpacking techniques for indexing genomes: II. Enhanced suffix arrays

Thomas D. Wu

Journal ArticleOPEN ACCESS

Bitpacking techniques for indexing genomes: II. Enhanced suffix arrays

Wu T

Algorithms for Molecular Biology (2016) 11(1)

DOI: 10.1186/S13015-016-0068-6

4Citations

13Readers

Abstract

Background: Suffix arrays and their variants are used widely for representing genomes in search applications. Enhanced suffix arrays (ESAs) provide fast search speed, but require large auxiliary data structures for storing longest common prefix and child interval information. We explore techniques for compressing ESAs to accelerate genomic search and reduce memory requirements. Results: We evaluate various bitpacking techniques that store integers in fewer than 32 bits each, as well as bytecoding methods that reserve a single byte per integer whenever possible. Our results on the fly, chicken, and human genomes show that bytecoding with an exception guide array is the fastest method for retrieving auxiliary information. Genomic searching can be further accelerated using a data structure called a discriminating character array, which reduces memory accesses to the suffix array and the genome string. Finally, integrating storage of the auxiliary and discriminating character arrays further speeds up genomic search. Conclusions: The combination of exception guide arrays, a discriminating character array, and integrated data storage provide a 2- to 3-fold increase in speed for genomic searching compared with using bytecoding alone, and is 20 % faster and 40 % more space-efficient than an uncompressed ESA.

Author supplied keywords

Cite

CITATION STYLE

APA

Wu, T. D. (2016). Bitpacking techniques for indexing genomes: II. Enhanced suffix arrays. Algorithms for Molecular Biology, 11(1). https://doi.org/10.1186/S13015-016-0068-6

Bitpacking techniques for indexing genomes: II. Enhanced suffix arrays

Abstract

Author supplied keywords

Cite

Register to see more suggestions