Protein sequences classification by means of feature extraction with substitution matrices

53Citations
Citations of this article
73Readers
Mendeley users who have this article in their library.

This article is free to access.

Abstract

Background: This paper deals with the preprocessing of protein sequences for supervised classification. Motif extraction is one way to address that task. It has been largely used to encode biological sequences into feature vectors to enable using well-known machine-learning classifiers which require this format. However, designing a suitable feature space, for a set of proteins, is not a trivial task. For this purpose, we propose a novel encoding method that uses amino-acid substitution matrices to define similarity between motifs during the extraction step.Results: In order to demonstrate the efficiency of such approach, we compare several encoding methods using some machine learning classifiers. The experimental results showed that our encoding method outperforms other ones in terms of classification accuracy and number of generated attributes. We also compared the classifiers in term of accuracy. Results indicated that SVM generally outperforms the other classifiers with any encoding method. We showed that SVM, coupled with our encoding method, can be an efficient protein classification system. In addition, we studied the effect of the substitution matrices variation on the quality of our method and hence on the classification quality. We noticed that our method enables good classification accuracies with all the substitution matrices and that the variances of the obtained accuracies using various substitution matrices are slight. However, the number of generated features varies from a substitution matrix to another. Furthermore, the use of already published datasets allowed us to carry out a comparison with several related works.Conclusions: The outcomes of our comparative experiments confirm the efficiency of our encoding method to represent protein sequences in classification tasks. © 2010 Saidi et al; licensee BioMed Central Ltd.

Cite

CITATION STYLE

APA

Saidi, R., Maddouri, M., & Mephu Nguifo, E. (2010). Protein sequences classification by means of feature extraction with substitution matrices. BMC Bioinformatics, 11. https://doi.org/10.1186/1471-2105-11-175

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free