A Dataset and Classifier for Recognizing Social Media English

17Citations
Citations of this article
83Readers
Mendeley users who have this article in their library.

Abstract

While language identification works well on standard texts, it performs much worse on social media language, in particular dialectal language-even for English. First, to support work on English language identification, we contribute a new dataset of tweets annotated for English versus non- English, with attention to ambiguity, codeswitching, and automatic generation issues. It is randomly sampled from all public messages, avoiding biases towards preexisting language classifiers. Second, we find that a demographic language model- which identifies messages with language similar to that used by several U.S. ethnic populations on Twitter-can be used to improve English language identification performance when combined with a traditional supervised language identifier. It increases recall with almost no loss of precision, including, surprisingly, for English messages written by non-U.S. authors.

Cite

CITATION STYLE

APA

Blodgett, S. L., Wei, J. T. Z., & O’Connor, B. (2017). A Dataset and Classifier for Recognizing Social Media English. In 3rd Workshop on Noisy User-Generated Text, W-NUT 2017 - Proceedings of the Workshop (pp. 56–61). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/w17-4408

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free