TopGuNN: Fast NLP Training Data Augmentation using Large Corpora

0Citations
Citations of this article
48Readers
Mendeley users who have this article in their library.

Abstract

Acquiring training data for natural language processing systems can be expensive and time-consuming. Given a few training examples crafted by experts, large corpora can be mined for thousands of semantically similar examples that provide useful variability to improve model generalization. We present TopGuNN, a fast contextualized k-NN retrieval system that can efficiently index and search over contextual embeddings generated from large corpora to easily retrieve new diverse training examples. TopGuNN is demonstrated for a semantic role labeling training data augmentation use case over the Gigaword corpus. Using approximate k-NN and an efficient architecture, TopGuNN performs queries over an embedding space of 4.63TB (approximately 1.5B embeddings) in less than a day.

Cite

CITATION STYLE

APA

Iglesias-Flores, R., Mishra, M., Patel, A., Malhotra, A., Kriz, R., Palmer, M., & Callison-Burch, C. (2021). TopGuNN: Fast NLP Training Data Augmentation using Large Corpora. In DaSH-LA 2021 - 2nd Workshop on Data Science with Human-in-the-Loop: Language Advances, Proceedings (pp. 86–101). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2021.dash-1.14

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free