While language identification works well on standard texts, it performs much worse on social media language, in particular dialectal language-even for English. First, to support work on English language identification, we contribute a new dataset of tweets annotated for English versus non- English, with attention to ambiguity, codeswitching, and automatic generation issues. It is randomly sampled from all public messages, avoiding biases towards preexisting language classifiers. Second, we find that a demographic language model- which identifies messages with language similar to that used by several U.S. ethnic populations on Twitter-can be used to improve English language identification performance when combined with a traditional supervised language identifier. It increases recall with almost no loss of precision, including, surprisingly, for English messages written by non-U.S. authors.
CITATION STYLE
Blodgett, S. L., Wei, J. T. Z., & O’Connor, B. (2017). A Dataset and Classifier for Recognizing Social Media English. In 3rd Workshop on Noisy User-Generated Text, W-NUT 2017 - Proceedings of the Workshop (pp. 56–61). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/w17-4408
Mendeley helps you to discover research relevant for your work.