Incorporating dialectal variability for socially equitable language identification

David Jurgens; Yulia Tsvetkov; Dan Jurafsky

Conference ProceedingsOPEN ACCESS

Incorporating dialectal variability for socially equitable language identification

ACL 2017 - 55th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference (Long Papers) (2017) 2 51-57

DOI: 10.18653/v1/P17-2009

67Citations

129Readers

Abstract

Language identification (LID) is a critical first step for processing multilingual text. Yet most LID systems are not designed to handle the linguistic diversity of global platforms like Twitter, where local dialects and rampant code-switching lead language classifiers to systematically miss minority dialect speakers and multilingual speakers. We propose a new dataset and a character-based sequence-to-sequence model for LID designed to support dialectal and multilingual language varieties. Our model achieves state-of-the-art performance on multiple LID benchmarks. Furthermore, in a case study using Twitter for health tracking, our method substantially increases the availability of texts written by underrepresented populations, enabling the development of “socially inclusive” NLP tools.

References Powered by Scopus

View more at Scopus

Cited by Powered by Scopus

View more at Scopus

Cite

CITATION STYLE

APA

Jurgens, D., Tsvetkov, Y., & Jurafsky, D. (2017). Incorporating dialectal variability for socially equitable language identification. In ACL 2017 - 55th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference (Long Papers) (Vol. 2, pp. 51–57). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/P17-2009

Readers' Seniority

PhD / Post grad / Masters / Doc 49

77%

Researcher 7

11%

Professor / Associate Prof. 4

Lecturer / Post doc 4

Readers' Discipline

Computer Science 54

75%

Linguistics 13

18%

Engineering 3

Social Sciences 2

Article Metrics

Mentions

News Mentions: 2

View details >

Incorporating dialectal variability for socially equitable language identification

Abstract

References Powered by Scopus

Learning phrase representations using RNN encoder-decoder for statistical machine translation

Predicting flu trends using twitter data

The social impact of natural language processing

Cited by Powered by Scopus

Language (Technology) is power: A critical survey of ⇜bias” in NLP

Automatic language identification in texts: A survey

Estimating code-switching on twitter with a novel generalized word-level language detection technique

Register to see more suggestions

Cite

Readers' Seniority

Readers' Discipline

Article Metrics