Scalability of continuous active learning for reliable high-recall text classification

63Citations
Citations of this article
47Readers
Mendeley users who have this article in their library.

Abstract

For finite document collections, continuous active learning ("CAL") has been observed to achieve high recall with high probability, at a labeling cost asymptotically proportional to the number of relevant documents. As the size of the collection increases, the number of relevant documents typically increases as well, thereby limiting the applicability of CAL to low-prevalence high-stakes classes, such as evidence in legal proceedings, or security threats, where human effort proportional to the number of relevant documents is justified. We present a scalable version of CAL ("S-CAL") that requires O(log N) labeling effort and O(N log N) computational effort - where N is the number of unlabeled training examples - to construct a classifier whose effectiveness for a given labeling cost compares favorably with previously reported methods. At the same time, S-CAL offers calibrated estimates of class prevalence, recall, and precision, facilitating both threshold setting and determination of the adequacy of the classifier.

Cite

CITATION STYLE

APA

Cormack, G. V., & Grossman, M. R. (2016). Scalability of continuous active learning for reliable high-recall text classification. In International Conference on Information and Knowledge Management, Proceedings (Vol. 24-28-October-2016, pp. 1039–1048). Association for Computing Machinery. https://doi.org/10.1145/2983323.2983776

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free