Incorporating dialectal variability for socially equitable language identification

67Citations
Citations of this article
129Readers
Mendeley users who have this article in their library.

Abstract

Language identification (LID) is a critical first step for processing multilingual text. Yet most LID systems are not designed to handle the linguistic diversity of global platforms like Twitter, where local dialects and rampant code-switching lead language classifiers to systematically miss minority dialect speakers and multilingual speakers. We propose a new dataset and a character-based sequence-to-sequence model for LID designed to support dialectal and multilingual language varieties. Our model achieves state-of-the-art performance on multiple LID benchmarks. Furthermore, in a case study using Twitter for health tracking, our method substantially increases the availability of texts written by underrepresented populations, enabling the development of “socially inclusive” NLP tools.

References Powered by Scopus

Learning phrase representations using RNN encoder-decoder for statistical machine translation

11652Citations
N/AReaders
Get full text

Predicting flu trends using twitter data

387Citations
N/AReaders
Get full text

The social impact of natural language processing

264Citations
N/AReaders
Get full text

Cited by Powered by Scopus

Language (Technology) is power: A critical survey of ⇜bias” in NLP

612Citations
N/AReaders
Get full text

Automatic language identification in texts: A survey

107Citations
N/AReaders
Get full text

Estimating code-switching on twitter with a novel generalized word-level language detection technique

69Citations
N/AReaders
Get full text

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Cite

CITATION STYLE

APA

Jurgens, D., Tsvetkov, Y., & Jurafsky, D. (2017). Incorporating dialectal variability for socially equitable language identification. In ACL 2017 - 55th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference (Long Papers) (Vol. 2, pp. 51–57). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/P17-2009

Readers' Seniority

Tooltip

PhD / Post grad / Masters / Doc 49

77%

Researcher 7

11%

Professor / Associate Prof. 4

6%

Lecturer / Post doc 4

6%

Readers' Discipline

Tooltip

Computer Science 54

75%

Linguistics 13

18%

Engineering 3

4%

Social Sciences 2

3%

Article Metrics

Tooltip
Mentions
News Mentions: 2

Save time finding and organizing research with Mendeley

Sign up for free