Asterisk

  • Nashaat M
  • Ghosh A
  • Miller J
  • et al.
N/ACitations
Citations of this article
9Readers
Mendeley users who have this article in their library.

Abstract

Labeling datasets is one of the most expensive bottlenecks in data preprocessing tasks in machine learning. Therefore, organizations, in many domains, are applying weak supervision to produce noisy labels. However, since weak supervision relies on cheaper sources, the quality of the generated labels is problematic. Therefore, in this article, we present Asterisk , an end-to-end framework to generate high-quality, large-scale labeled datasets. The system, first, automatically generates heuristics to assign initial labels. Then, the framework applies a novel data-driven active learning process to enhance the labeling quality. We present an algorithm that learns the selection policy by accommodating the modeled accuracies of the heuristics, along with the outcome of the generative model. Finally, the system employs the output of the active learning process to enhance the quality of the labels. To evaluate the proposed system, we report its performance against four state-of-the-art techniques. In collaboration with our industrial partner, IBM, we test the framework within a wide range of real-world applications. The experiments include 10 datasets of varying sizes with a maximum size of 11 million records. The results illustrate the effectiveness of the framework in producing high-quality labels and achieving high classification accuracy with minimal annotation efforts.

Cite

CITATION STYLE

APA

Nashaat, M., Ghosh, A., Miller, J., & Quader, S. (2020). Asterisk. ACM/IMS Transactions on Data Science, 1(2), 1–25. https://doi.org/10.1145/3385188

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free