Compressing DNA sequence databases with coil

12Citations
Citations of this article
39Readers
Mendeley users who have this article in their library.

This article is free to access.

Abstract

Background: Publicly available DNA sequence databases such as GenBank are large, and are growing at an exponential rate. The sheer volume of data being dealt with presents serious storage and data communications problems. Currently, sequence data is usually kept in large "flat files," which are then compressed using standard Lempel-Ziv (gzip) compression - an approach which rarely achieves good compression ratios. While much research has been done on compressing individual DNA sequences, surprisingly little has focused on the compression of entire databases of such sequences. In this study we introduce the sequence database compression software coil. Results: We have designed and implemented a portable software package, coil, for compressing and decompressing DNA sequence databases based on the idea of edit-tree coding. coil is geared towards achieving high compression ratios at the expense of execution time and memory usage during compression - the compression time represents a "one-off investment" whose cost is quickly amortised if the resulting compressed file is transmitted many times. Decompression requires little memory and is extremely fast. We demonstrate a 5% improvement in compression ratio over state-of-the-art general-purpose compression tools for a large GenBank database file containing Expressed Sequence Tag (EST) data. Finally, coil can efficiently encode incremental additions to a sequence database. Conclusion: Coil presents a compelling alternative to conventional compression of flat files for the storage and distribution of DNA sequence databases having a narrow distribution of sequence lengths, such as EST data. Increasing compression levels for databases having a wide distribution of sequence lengths is a direction for future work. © 2008 White and Hendy; licensee BioMed Central Ltd.

References Powered by Scopus

CLUSTAL W: Improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice

58458Citations
N/AReaders
Get full text

Cd-hit: A fast program for clustering and comparing large sets of protein or nucleotide sequences

7935Citations
N/AReaders
Get full text

A Universal Algorithm for Sequential Data Compression

3962Citations
N/AReaders
Get full text

Cited by Powered by Scopus

Human genomes as email attachments

118Citations
N/AReaders
Get full text

Compressive biological sequence analysis and archival in the era of high-throughput sequencing technologies

48Citations
N/AReaders
Get full text

Data structures and compression algorithms for high-throughput sequencing technologies

47Citations
N/AReaders
Get full text

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Cite

CITATION STYLE

APA

White, W. T. J., & Hendy, M. D. (2008). Compressing DNA sequence databases with coil. BMC Bioinformatics, 9. https://doi.org/10.1186/1471-2105-9-242

Readers' Seniority

Tooltip

PhD / Post grad / Masters / Doc 16

50%

Professor / Associate Prof. 8

25%

Researcher 7

22%

Lecturer / Post doc 1

3%

Readers' Discipline

Tooltip

Computer Science 17

50%

Agricultural and Biological Sciences 10

29%

Engineering 5

15%

Biochemistry, Genetics and Molecular Bi... 2

6%

Save time finding and organizing research with Mendeley

Sign up for free