Efficient use of unlabeled data for protein sequence classification: A comparative study

Pavel Kuksa; Pai Hsi Huang; Vladimir Pavlovic

Conference ProceedingsOPEN ACCESS

Efficient use of unlabeled data for protein sequence classification: A comparative study

BMC Bioinformatics (2009) 10(SUPPL. 4)

DOI: 10.1186/1471-2105-10-S4-S2

9Citations

9Readers

Abstract

Background: Recent studies in computational primary protein sequence analysis have leveraged the power of unlabeled data. For example, predictive models based on string kernels trained on sequences known to belong to particular folds or superfamilies, the so-called labeled data set, can attain significantly improved accuracy if this data is supplemented with protein sequences that lack any class tags-the unlabeled data. In this study, we present a principled and biologically motivated computational framework that more effectively exploits the unlabeled data by only using the sequence regions that are more likely to be biologically relevant for better prediction accuracy. As overly-represented sequences in large uncurated databases may bias the estimation of computational models that rely on unlabeled data, we also propose a method to remove this bias and improve performance of the resulting classifiers. Results: Combined with state-of-the-art string kernels, our proposed computational framework achieves very accurate semi-supervised protein remote fold and homology detection on three large unlabeled databases. It outperforms current state-of-the-art methods and exhibits significant reduction in running time. Conclusion: The unlabeled sequences used under the semi-supervised setting resemble the unpolished gemstones; when used as-is, they may carry unnecessary features and hence compromise the classification accuracy but once cut and polished, they improve the accuracy of the classifiers considerably. © 2009 Kuksa et al; licensee BioMed Central Ltd.

Cite

CITATION STYLE

APA

Kuksa, P., Huang, P. H., & Pavlovic, V. (2009). Efficient use of unlabeled data for protein sequence classification: A comparative study. In BMC Bioinformatics (Vol. 10). https://doi.org/10.1186/1471-2105-10-S4-S2

Efficient use of unlabeled data for protein sequence classification: A comparative study

Abstract

Cite

Register to see more suggestions