Large language models are trained on increasing quantities of unstructured text, the largest sources of which are scraped from the Web. These Web scrapes are mainly composed of heterogeneous collections of text from multiple domains with minimal documentation. While some work has been done to identify and remove toxic, biased, or sexual language, the topic of personal information (PI) in textual data used for training Natural Language Processing (NLP) models is relatively underexplored. In this work, we draw from definitions of PI across multiple countries to define the first PI taxonomy of its kind, categorized by type and risk level. We then conduct a case study on the Colossal Clean Crawled Corpus (C4) and the Pile, to detect some of the highest-risk personal information, such as email addresses and credit card numbers, and examine the differences between automatic and regular expression-based approaches for their detection. We identify shortcomings in modern approaches for PI detection, and propose a reframing of the problem that is informed by global perspectives and the goals in personal information detection.
CITATION STYLE
Subramani, N., Luccioni, A. S., Dodge, J., & Mitchell, M. (2023). Detecting Personal Information in Training Corpora: an Analysis. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (pp. 208–220). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2023.trustnlp-1.18
Mendeley helps you to discover research relevant for your work.