"I'm" Lost in Translation: Pronoun Missteps in Crowdsourced Data Sets

Katie Seaborn; Yeongdae Kim

Conference ProceedingsOPEN ACCESS

"I'm" Lost in Translation: Pronoun Missteps in Crowdsourced Data Sets

Conference on Human Factors in Computing Systems - Proceedings (2023)

DOI: 10.1145/3544549.3585667

4Citations

10Readers

Get full text

Abstract

As virtual assistants continue to be taken up globally, there is an ever-greater need for these speech-based systems to communicate naturally in a variety of languages. Crowdsourcing initiatives have focused on multilingual translation of big, open data sets for use in natural language processing (NLP). Yet, language translation is often not one-to-one, and biases can trickle in. In this late-breaking work, we focus on the case of pronouns translated between English and Japanese in the crowdsourced Tatoeba database. We found that masculine pronoun biases were present overall, even though plurality in language was accounted for in other ways. Importantly, we detected biases in the translation process that reflect nuanced reactions to the presence of feminine, neutral, and/or non-binary pronouns. We raise the issue of translation bias for pronouns and offer a practical solution to embed plurality in NLP data sets.

Author supplied keywords

Cite

CITATION STYLE

APA

Seaborn, K., & Kim, Y. (2023). “I’m” Lost in Translation: Pronoun Missteps in Crowdsourced Data Sets. In Conference on Human Factors in Computing Systems - Proceedings. Association for Computing Machinery. https://doi.org/10.1145/3544549.3585667

"I'm" Lost in Translation: Pronoun Missteps in Crowdsourced Data Sets

Abstract

Author supplied keywords

Cite

Register to see more suggestions