Text compressor algorithms can be used to construct metric distance measures (CBDs) suitable for character sequences. Here we review the principle of various types of compressor algorithms and describe their general behaviour with respect to the comparison of protein and DNA sequences. We employ reduced and enlarged alphabets, and model biological rearrangements like domain shuffling. In the classification experiments evaluated with ROC analysis, CBDs perform less well than substring-based methods such as the BLAST and the Smith-Waterman algorithms, but perform better than distances based on word composition. CBDs outperformed substring methods with respect to domain shuffling, and in some cases showed an increased performance when the alphabet was reduced. © 2009 Springer US.
CITATION STYLE
Kertesz-Farkas, A., Kocsor, A., & Pongor, S. (2009). The application of data compression-based distances to biological sequences. In Information Theory and Statistical Learning (pp. 83–100). Springer US. https://doi.org/10.1007/978-0-387-84816-7_4
Mendeley helps you to discover research relevant for your work.