Acquiring training data for natural language processing systems can be expensive and time-consuming. Given a few training examples crafted by experts, large corpora can be mined for thousands of semantically similar examples that provide useful variability to improve model generalization. We present TopGuNN, a fast contextualized k-NN retrieval system that can efficiently index and search over contextual embeddings generated from large corpora to easily retrieve new diverse training examples. TopGuNN is demonstrated for a semantic role labeling training data augmentation use case over the Gigaword corpus. Using approximate k-NN and an efficient architecture, TopGuNN performs queries over an embedding space of 4.63TB (approximately 1.5B embeddings) in less than a day.
CITATION STYLE
Iglesias-Flores, R., Mishra, M., Patel, A., Malhotra, A., Kriz, R., Palmer, M., & Callison-Burch, C. (2021). TopGuNN: Fast NLP Training Data Augmentation using Large Corpora. In DaSH-LA 2021 - 2nd Workshop on Data Science with Human-in-the-Loop: Language Advances, Proceedings (pp. 86–101). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2021.dash-1.14
Mendeley helps you to discover research relevant for your work.