ERGC: An efficient referential genome compression algorithm

Subrata Saha; Sanguthevar Rajasekaran

Journal ArticleOPEN ACCESS

ERGC: An efficient referential genome compression algorithm

Bioinformatics (2015) 31(21) 3468-3475

DOI: 10.1093/bioinformatics/btv399

34Citations

18Readers

Abstract

Motivation: Genome sequencing has become faster and more affordable. Consequently, the number of available complete genomic sequences is increasing rapidly. As a result, the cost to store, process, analyze and transmit the data is becoming a bottleneck for research and future medical applications. So, the need for devising efficient data compression and data reduction techniques for biological sequencing data is growing by the day. Although there exists a number of standard data compression algorithms, they are not efficient in compressing biological data. These generic algorithms do not exploit some inherent properties of the sequencing data while compressing. To exploit statistical and information-theoretic properties of genomic sequences, we need specialized compression algorithms. Five different next-generation sequencing data compression problems have been identified and studied in the literature. We propose a novel algorithm for one of these problems known as reference-based genome compression. Results: We have done extensive experiments using five real sequencing datasets. The results on real genomes show that our proposed algorithm is indeed competitive and performs better than the best known algorithms for this problem. It achieves compression ratios that are better than those of the currently best performing algorithms. The time to compress and decompress the whole genome is also very promising. Availability and implementation: The implementations are freely available for non-commercial purposes. They can be downloaded from http://engr.uconn.edu/∼rajasek/ERGC.zip.

Cite

CITATION STYLE

APA

Saha, S., & Rajasekaran, S. (2015). ERGC: An efficient referential genome compression algorithm. Bioinformatics, 31(21), 3468–3475. https://doi.org/10.1093/bioinformatics/btv399

ERGC: An efficient referential genome compression algorithm

Abstract

Cite

Register to see more suggestions