Does Putting a Linguist in the Loop Improve NLU Data Collection?

22Citations
Citations of this article
58Readers
Mendeley users who have this article in their library.

Abstract

Many crowdsourced NLP datasets contain systematic artifacts that are identified only after data collection is complete. Earlier identification of these issues should make it easier to create high-quality training and evaluation data. We attempt this by evaluating protocols in which expert linguists work 'in the loop' during data collection to identify and address these issues by adjusting task instructions and incentives. Using natural language inference as a test case, we compare three data collection protocols: (i) a baseline protocol with no linguist involvement, (ii) a linguist-in-the-loop intervention with iteratively-updated constraints on the writing task, and (iii) an extension that adds direct interaction between linguists and crowdworkers via a chatroom. We find that linguist involvement does not lead to increased accuracy on out-of-domain test sets compared to baseline, and adding a chatroom has no effect on the data. Linguist involvement does, however, lead to more challenging evaluation data and higher accuracy on some challenge sets, demonstrating the benefits of integrating expert analysis during data collection.

Cite

CITATION STYLE

APA

Parrish, A., Huang, W., Agha, O., Lee, S. H., Nangia, N., Warstadt, A., … Bowman, S. R. (2021). Does Putting a Linguist in the Loop Improve NLU Data Collection? In Findings of the Association for Computational Linguistics, Findings of ACL: EMNLP 2021 (pp. 4886–4901). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2021.findings-emnlp.421

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free