Classifying biological sequences is one of the most important tasks in computational biology. In the last decade, support vector machines (SVMs) in combination with sequence kernels have emerged as a de-facto standard. These methods are theoretically well-founded, reliable, and provide high-accuracy solutions at low computational cost. However, obtaining a highly accurate classifier is rarely the end of the story in many practical situations. Instead, one often aims to acquire biological knowledge about the principles underlying a given classification task. SVMs with traditional sequence kernels do not offer a straightforward way of accessing this knowledge.In this contribution, we propose a new approach to analyzing biological sequences on the basis of support vector machines with sequence kernels. We first extract explicit pattern weights from a given SVM. When classifying a sequence, we then compute a prediction profile by distributing the weight of each pattern to the sequence positions that match the pattern. The final profile not only allows assessing the importance of a position, but also determining for which class it is indicative. Since it is unfeasible to analyze profiles of all sequences in a given data set, we advocate using affinity propagation (AP) clustering to narrow down the analysis to a small set of typical sequences.The proposed approach is applicable to a wide range of biological sequences and a wide selection of sequence kernels. To illustrate our framework, we present the prediction of oligomerization tendencies of coiled coil proteins as a case study.
CITATION STYLE
Bodenhofer, U., Kothmeier, A., Abfalter, I., Mahrenholz, C., & Hochreiter, S. (2010). Decoding Sequence Classification Models for Acquiring New Biological Insights. Nature Precedings. https://doi.org/10.1038/npre.2010.4708.1
Mendeley helps you to discover research relevant for your work.