The application of data compression-based distances to biological sequences

3Citations
Citations of this article
14Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Text compressor algorithms can be used to construct metric distance measures (CBDs) suitable for character sequences. Here we review the principle of various types of compressor algorithms and describe their general behaviour with respect to the comparison of protein and DNA sequences. We employ reduced and enlarged alphabets, and model biological rearrangements like domain shuffling. In the classification experiments evaluated with ROC analysis, CBDs perform less well than substring-based methods such as the BLAST and the Smith-Waterman algorithms, but perform better than distances based on word composition. CBDs outperformed substring methods with respect to domain shuffling, and in some cases showed an increased performance when the alphabet was reduced. © 2009 Springer US.

Cite

CITATION STYLE

APA

Kertesz-Farkas, A., Kocsor, A., & Pongor, S. (2009). The application of data compression-based distances to biological sequences. In Information Theory and Statistical Learning (pp. 83–100). Springer US. https://doi.org/10.1007/978-0-387-84816-7_4

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free