Bigger isn t better: The ethical and scientific vices of extra-large datasets in language models

6Citations
Citations of this article
12Readers
Mendeley users who have this article in their library.
Get full text

Abstract

The use of language models in Web applications and other areas of computing and business have grown significantly over the last five years. One reason for this growth is the improvement in performance of language models on a number of benchmarks-but a side effect of these advances has been the adoption of a "bigger is always better"paradigm when it comes to the size of training, testing, and challenge datasets. Drawing on previous criticisms of this paradigm as applied to large training datasets crawled from pre-existing text on the Web, we extend the critique to challenge datasets custom-created by crowdworkers. We present several sets of criticisms, where ethical and scientific issues in language model research reinforce each other: labour injustices in crowdwork, dataset quality and inscrutability, inequities in the research community, and centralized corporate control of the technology. We also present a new type of tool for researchers to use in examining large datasets when evaluating them for quality.

Cite

CITATION STYLE

APA

Goetze, T. S., & Abramson, D. (2021). Bigger isn t better: The ethical and scientific vices of extra-large datasets in language models. In ACM International Conference Proceeding Series (pp. 69–75). Association for Computing Machinery. https://doi.org/10.1145/3462741.3466809

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free