The application of data compression-based distances to biological sequences

Attila Kertesz-Farkas; Andras Kocsor; Sandor Pongor

Book Chapter

The application of data compression-based distances to biological sequences

Springer US, (2009), 83-100

DOI: 10.1007/978-0-387-84816-7_4

3Citations

14Readers

Get full text

Abstract

Text compressor algorithms can be used to construct metric distance measures (CBDs) suitable for character sequences. Here we review the principle of various types of compressor algorithms and describe their general behaviour with respect to the comparison of protein and DNA sequences. We employ reduced and enlarged alphabets, and model biological rearrangements like domain shuffling. In the classification experiments evaluated with ROC analysis, CBDs perform less well than substring-based methods such as the BLAST and the Smith-Waterman algorithms, but perform better than distances based on word composition. CBDs outperformed substring methods with respect to domain shuffling, and in some cases showed an increased performance when the alphabet was reduced. © 2009 Springer US.

Cite

CITATION STYLE

APA

Kertesz-Farkas, A., Kocsor, A., & Pongor, S. (2009). The application of data compression-based distances to biological sequences. In Information Theory and Statistical Learning (pp. 83–100). Springer US. https://doi.org/10.1007/978-0-387-84816-7_4

The application of data compression-based distances to biological sequences

Abstract

Cite

Register to see more suggestions