Abstract
Motivation Bioinformatics databases grow rapidly and achieve values hardly to imagine a decade ago. Among numerous bioinformatics processes generating hundreds of GB is multiple sequence alignments of protein families. Its largest database, i.e. Pfam, consumes 40-230 GB, depending of the variant. Storage and transfer of such massive data has become a challenge. Results We propose a novel compression algorithm, CoMSA, designed especially for aligned data. It is based on a generalization of the positional Burrows-Wheeler transform for non-binary alphabets. CoMSA handles FASTA, as well as Stockholm files. It offers up to six times better compression ratio than other commonly used compressors, i.e. gzip. Performed experiments resulted in an analysis of the influence of a protein family size on the compression ratio. Availability and implementation CoMSA is available for free at https://github.com/refresh-bio/comsa and http://sun.aei.polsl.pl/REFRESH/comsa. Supplementary materialSupplementary dataare available at Bioinformatics online.
Cite
CITATION STYLE
Deorowicz, S., Walczyszyn, J., & Debudaj-Grabysz, A. (2019). CoMSA: Compression of protein multiple sequence alignment files. Bioinformatics, 35(2), 227–234. https://doi.org/10.1093/bioinformatics/bty619
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.