Abstract
String kernel-based machine learning methods have yielded great success in practical tasks of struc- Tured/sequential data analysis. They often exhibit state-of-the-art performance on tasks such as docu- ment topic elucidation, music genre classification, pro- Tein superfamily and fold prediction. However, typi- cal string kernel methods rely on symbolic Hamming- distance based matching which may not necessarily reect the underlying (e.g., physical) similarity between sequence fragments. In this work we propose a novel computational framework that uses general similarity metrics S(·; ·) and distance-preserving embeddings with string kernels to improve sequence classification. In par- Ticular, we consider two approaches that allow one ei- Ther to incorporate non-Hamming similarity S(·;·) into similarity evaluation by matching only the features that are similar according to S(·; ·) or to retain actual (ap- proximate) similarity/distance scores in similarity eval- uation. An embedding step, a distance-preserving bit- string mapping, is used to effectively capture similarity between otherwise symbolically different sequence ele- ments. We show that it is possible to retain computa- Tional efficiency of string kernels while using this more "precise" measure of similarity. We then demonstrate that on a number of sequence classification tasks such as music, and biological sequence classification, the new method can substantially improve upon state-of-the-art string kernel baselines. Copyright © 2012 by the Society for Industrial and Applied Mathematics.
Author supplied keywords
Cite
CITATION STYLE
Kuksa, P. P., Khan, I., & Pavlovic, V. (2012). Generalized similarity kernels for efficient sequence classification. In Proceedings of the 12th SIAM International Conference on Data Mining, SDM 2012 (pp. 873–882). https://doi.org/10.1137/1.9781611972825.75
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.