Motivation: In the genomics setting, an increasingly common data configuration consists of a small set of sequences possessing a targeted property (positive instances) amongst a large set of sequences for which class membership is unknown (unlabeled instances). Traditional two-class classification methods do not effectively handle such data. Results: Here, we develop a novel method, likely positive-iterative classification (LP-IC) for this problem, and contrast its performance with the few existing methods, most of which were devised and utilized in the text classification context. LP-IC employs an iterative classification scheme and introduces a class dispersion measure, adopted from unsupervised clustering approaches, to monitor the model selection process. Using two case studies-prediction of HLA binding, and alternative splicing conservation between human and mouse - we show that LP-IC provides superior performance to existing methodologies in terms of: (i) combined accuracy and precision in positive identification from the unlabeled set; and (ii) predictive performance of the resultant classifiers on independent test data. © The Author 2008. Published by Oxford University Press. All rights reserved.
CITATION STYLE
Xiao, Y., & Segal, M. R. (2008). Biological sequence classification utilizing positive and unlabeled data. Bioinformatics, 24(9), 1198–1205. https://doi.org/10.1093/bioinformatics/btn089
Mendeley helps you to discover research relevant for your work.