Sequence data are widely used to get a deeper insight into biological systems. From a data analysis perspective they are given as a set of sequences of symbols with varying length. In general they are compared using nonmetric score functions. In this form the data are nonstandard, because they do not provide an immediate metric vector space and their analysis using standard methods is complicated. In this chapter we provide various strategies for how to analyze these type of data in a mathematically accurate way instead of the often seen ad hoc solutions. Our approach is based on the scoring values from protein sequence data although be applicable in a broader sense. We discuss potential recoding concepts of the scores and discuss algorithms to solve clustering, classification and embedding tasks for score data for a protein sequence application.
CITATION STYLE
Schleif, F. M. (2016). Protein sequence analysis by proximities. In Methods in Molecular Biology (Vol. 1362, pp. 185–195). Humana Press Inc. https://doi.org/10.1007/978-1-4939-3106-4_12
Mendeley helps you to discover research relevant for your work.