For finite document collections, continuous active learning ("CAL") has been observed to achieve high recall with high probability, at a labeling cost asymptotically proportional to the number of relevant documents. As the size of the collection increases, the number of relevant documents typically increases as well, thereby limiting the applicability of CAL to low-prevalence high-stakes classes, such as evidence in legal proceedings, or security threats, where human effort proportional to the number of relevant documents is justified. We present a scalable version of CAL ("S-CAL") that requires O(log N) labeling effort and O(N log N) computational effort - where N is the number of unlabeled training examples - to construct a classifier whose effectiveness for a given labeling cost compares favorably with previously reported methods. At the same time, S-CAL offers calibrated estimates of class prevalence, recall, and precision, facilitating both threshold setting and determination of the adequacy of the classifier.
CITATION STYLE
Cormack, G. V., & Grossman, M. R. (2016). Scalability of continuous active learning for reliable high-recall text classification. In International Conference on Information and Knowledge Management, Proceedings (Vol. 24-28-October-2016, pp. 1039–1048). Association for Computing Machinery. https://doi.org/10.1145/2983323.2983776
Mendeley helps you to discover research relevant for your work.