CoMSA: Compression of protein multiple sequence alignment files

10Citations
Citations of this article
13Readers
Mendeley users who have this article in their library.

This article is free to access.

Abstract

Motivation Bioinformatics databases grow rapidly and achieve values hardly to imagine a decade ago. Among numerous bioinformatics processes generating hundreds of GB is multiple sequence alignments of protein families. Its largest database, i.e. Pfam, consumes 40-230 GB, depending of the variant. Storage and transfer of such massive data has become a challenge. Results We propose a novel compression algorithm, CoMSA, designed especially for aligned data. It is based on a generalization of the positional Burrows-Wheeler transform for non-binary alphabets. CoMSA handles FASTA, as well as Stockholm files. It offers up to six times better compression ratio than other commonly used compressors, i.e. gzip. Performed experiments resulted in an analysis of the influence of a protein family size on the compression ratio. Availability and implementation CoMSA is available for free at https://github.com/refresh-bio/comsa and http://sun.aei.polsl.pl/REFRESH/comsa. Supplementary materialSupplementary dataare available at Bioinformatics online.

Cite

CITATION STYLE

APA

Deorowicz, S., Walczyszyn, J., & Debudaj-Grabysz, A. (2019). CoMSA: Compression of protein multiple sequence alignment files. Bioinformatics, 35(2), 227–234. https://doi.org/10.1093/bioinformatics/bty619

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free