Biological sequence classification utilizing positive and unlabeled data

11Citations
Citations of this article
22Readers
Mendeley users who have this article in their library.

Abstract

Motivation: In the genomics setting, an increasingly common data configuration consists of a small set of sequences possessing a targeted property (positive instances) amongst a large set of sequences for which class membership is unknown (unlabeled instances). Traditional two-class classification methods do not effectively handle such data. Results: Here, we develop a novel method, likely positive-iterative classification (LP-IC) for this problem, and contrast its performance with the few existing methods, most of which were devised and utilized in the text classification context. LP-IC employs an iterative classification scheme and introduces a class dispersion measure, adopted from unsupervised clustering approaches, to monitor the model selection process. Using two case studies-prediction of HLA binding, and alternative splicing conservation between human and mouse - we show that LP-IC provides superior performance to existing methodologies in terms of: (i) combined accuracy and precision in positive identification from the unlabeled set; and (ii) predictive performance of the resultant classifiers on independent test data. © The Author 2008. Published by Oxford University Press. All rights reserved.

Cite

CITATION STYLE

APA

Xiao, Y., & Segal, M. R. (2008). Biological sequence classification utilizing positive and unlabeled data. Bioinformatics, 24(9), 1198–1205. https://doi.org/10.1093/bioinformatics/btn089

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free