Detecting Personal Information in Training Corpora: an Analysis

Nishant Subramani; Alexandra Sasha Luccioni; Jesse Dodge; Margaret Mitchell

Conference ProceedingsOPEN ACCESS

Detecting Personal Information in Training Corpora: an Analysis

Proceedings of the Annual Meeting of the Association for Computational Linguistics (2023) 208-220

DOI: 10.18653/v1/2023.trustnlp-1.18

5Citations

16Readers

Abstract

Large language models are trained on increasing quantities of unstructured text, the largest sources of which are scraped from the Web. These Web scrapes are mainly composed of heterogeneous collections of text from multiple domains with minimal documentation. While some work has been done to identify and remove toxic, biased, or sexual language, the topic of personal information (PI) in textual data used for training Natural Language Processing (NLP) models is relatively underexplored. In this work, we draw from definitions of PI across multiple countries to define the first PI taxonomy of its kind, categorized by type and risk level. We then conduct a case study on the Colossal Clean Crawled Corpus (C4) and the Pile, to detect some of the highest-risk personal information, such as email addresses and credit card numbers, and examine the differences between automatic and regular expression-based approaches for their detection. We identify shortcomings in modern approaches for PI detection, and propose a reframing of the problem that is informed by global perspectives and the goals in personal information detection.

Cite

CITATION STYLE

APA

Subramani, N., Luccioni, A. S., Dodge, J., & Mitchell, M. (2023). Detecting Personal Information in Training Corpora: an Analysis. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (pp. 208–220). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2023.trustnlp-1.18

Detecting Personal Information in Training Corpora: an Analysis

Abstract

Cite

Register to see more suggestions