A Critical Analysis of the Largest Source for Generative AI Training Data: Common Crawl

Stefan Baack

Conference ProceedingsOPEN ACCESS

A Critical Analysis of the Largest Source for Generative AI Training Data: Common Crawl

Baack S

2024 ACM Conference on Fairness, Accountability, and Transparency, FAccT 2024 (2024) 2199-2208

DOI: 10.1145/3630106.3659033

5Citations

47Readers

Abstract

Common Crawl is the largest freely available collection of web crawl data and one of the most important sources of pre-training data for large language models (LLMs). It is used so frequently and makes up such large proportions of the overall pre-training data in many cases that it arguably has become a foundational building block for LLM development, and subsequently generative AI products built on top of LLMs. Despite its pivotal role, Common Crawl itself is not widely understood, nor is there much reflection evident among LLM builders about the implications of using Common Crawl's data. This paper discusses what Common Crawl's popularity for LLM development means for fairness, accountability, and transparency in generative AI by highlighting the organization's values and practices, as well as how it views its own role within the AI ecosystem. Our qualitative analysis is based on in-depth interviews with Common Crawl staffers and relevant online documents. After discussing Common Crawl's role in generative AI and how LLM builders have typically used its data for pre-training LLMs, we review Common Crawl's self-defined values and priorities and highlight the limitations and biases of its crawling process. We find that Common Crawl's popularity has contributed to making generative AI more transparent to scrutiny in many ways, and that it has enabled more LLM research and development to take place beyond well-resourced leading AI companies. At the same time, many LLM builders have used Common Crawl as a source for training data in ways that are problematic: for instance, with lack of care and transparency for how Common Crawl's massive crawl data was filtered for harmful content before the pre-training, often by relying on rudimentary automated filtering techniques. We offer recommendations for Common Crawl and LLM builders on how to improve fairness, accountability, and transparency in LLM research and development.

Author supplied keywords

References Powered by Scopus

View more at Scopus

Cited by Powered by Scopus

View more at Scopus

Cite

CITATION STYLE

APA

Baack, S. (2024). A Critical Analysis of the Largest Source for Generative AI Training Data: Common Crawl. In 2024 ACM Conference on Fairness, Accountability, and Transparency, FAccT 2024 (pp. 2199–2208). Association for Computing Machinery, Inc. https://doi.org/10.1145/3630106.3659033

Readers over time

Readers' Seniority

PhD / Post grad / Masters / Doc 8

53%

Researcher 6

40%

Professor / Associate Prof. 1

Readers' Discipline

Engineering 5

38%

Arts and Humanities 3

23%

Computer Science 3

23%

Social Sciences 2

15%

Article Metrics

Mentions

News Mentions: 1

View details >

A Critical Analysis of the Largest Source for Generative AI Training Data: Common Crawl

Abstract

Author supplied keywords

References Powered by Scopus

GloVe: Global vectors for word representation

On the dangers of stochastic parrots: Can language models be too big?

#Gamergate and The Fappening: How Reddit’s algorithm, governance, and culture support toxic technocultures

Cited by Powered by Scopus

The sociolinguistic foundations of language modeling

Collaborative Growth: When Large Language Models Meet Sociolinguistics

Why ‘open’ AI systems are actually closed, and why this matters

Register to see more suggestions

Cite

Readers over time

Readers' Seniority

Readers' Discipline

Article Metrics