Continued growth of generated sequencing data demands novel scalable approaches to its storage and transmission. It is also crucial that analyses can be run on data in its compressed form without having to fully reconstruct it. We propose a novel approach to compression of sequence alignment data, a well established data format that is used for a variety of tasks ranging from genome assembly to variant calling. Such alignment files may exceed the size of the original sequence by an order of magnitude, however, Referee, our tool implementing the approach, is able to compress alignment files to 1/10 of the original SAM file size and is twice as efficient as SAM's binary BAM variant. Referee is fast, highly parallelizable, and outperforms state of the art tools by an average of 8.1% while enabling a variety of sequence-related tasks that require only a partial decompression. Computations like depth of sequencing that involve seeking through all alignments take from 8 to 44 seconds for Referee as opposed to tens of minutes with samtools. Referee uses a lightweight streaming clustering algorithm to improve quality values compression and encodes sequence information very efficiently, with compression rates as low as 0.06 bits per base. Its modular structure allows one to omit extraneous alignment information from the download reducing sequencing data from many gigabytes to under a hundred megabytes.
CITATION STYLE
Filippova, D., & Kingsford, C. (2015). Rapid, separable compression enables fast analyses of sequence alignments. In BCB 2015 - 6th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics (pp. 194–201). Association for Computing Machinery, Inc. https://doi.org/10.1145/2808719.2808739
Mendeley helps you to discover research relevant for your work.