Disk-based compression of data from genome sequencing

Szymon Grabowski; Sebastian Deorowicz; Lukasz Roguski

Journal ArticleOPEN ACCESS

Disk-based compression of data from genome sequencing

Bioinformatics (2015) 31(9) 1389-1395

DOI: 10.1093/bioinformatics/btu844

53Citations

41Readers

Abstract

Motivation: High-coverage sequencing data have significant, yet hard to exploit, redundancy. Most FASTQ compressors cannot efficiently compress the DNA stream of large datasets, since the redundancy between overlapping reads cannot be easily captured in the (relatively small) main memory. More interesting solutions for this problem are disk based, where the better of these two, from Cox et al. (2012), is based on the Burrows-Wheeler transform (BWT) and achieves 0.518 bits per base for a 134.0∈Gbp human genome sequencing collection with almost 45-fold coverage. Results: We propose overlapping reads compression with minimizers, a compression algorithm dedicated to sequencing reads (DNA only). Our method makes use of a conceptually simple and easily parallelizable idea of minimizers, to obtain 0.317 bits per base as the compression ratio, allowing to fit the 134.0∈Gbp dataset into only 5.31∈GB of space.

Cite

CITATION STYLE

APA

Grabowski, S., Deorowicz, S., & Roguski, L. (2015). Disk-based compression of data from genome sequencing. Bioinformatics, 31(9), 1389–1395. https://doi.org/10.1093/bioinformatics/btu844

Disk-based compression of data from genome sequencing

Abstract

Cite

Register to see more suggestions