Unsupervised dimension reduction methods for protein sequence classification

3Citations
Citations of this article
3Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Feature extraction methods are widely applied in order to reduce the dimensionality of data for subsequent classification, thus decreasing the risk of noise fitting. Principal Component Analysis (PCA) is a popular linear method for transforming high-dimensional data into a low-dimensional representation. Non-linear and non-parametric methods for dimension reduction, such as Isomap, Stochastic Neighbor Embedding (SNE) and Interpol are also used. In this study, we compare the performance of PCA, Isomap, t-SNE and Interpol as preprocessing steps for classification of protein sequences. Using random forests, we compared the classification performance on two artificial and eighteen real-world protein data sets, including HIV drug resistance, HIV-1 co-receptor usage and protein functional class prediction, preprocessed with PCA, Isomap, t-SNE and Interpol. Significant differences between these feature extraction methods were observed. The prediction performance of Interpol converges towards a stable and significantly higher value compared to PCA, Isomap and t-SNE. This is probably due to the nature of protein sequences, where amino acid are often dependent from and affect each other to achieve, for instance, conformational stability. However, visualization of data reduced with Interpol is rather unintuitive, compared to the other methods. We conclude that Interpol is superior to PCA, Isomap and t-SNE for feature extraction previous to classification, but is of limited use for visualization.

Cite

CITATION STYLE

APA

Heider, D., Bartenhagen, C., Dybowski, J. N., Hauke, S., Pyka, M., & Hoffmann, D. (2014). Unsupervised dimension reduction methods for protein sequence classification. In Studies in Classification, Data Analysis, and Knowledge Organization (Vol. 47, pp. 295–302). Kluwer Academic Publishers. https://doi.org/10.1007/978-3-319-01595-8_32

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free