Normalizing tweets with edit scripts and recurrent neural embeddings

62Citations
Citations of this article
154Readers
Mendeley users who have this article in their library.

Abstract

Tweets often contain a large proportion of abbreviations, alternative spellings, novel words and other non-canonical language. These features are problematic for standard language analysis tools and it can be desirable to convert them to canonical form. We propose a novel text normalization model based on learning edit operations from labeled data while incorporating features induced from unlabeled data via character-level neural text embeddings. The text embeddings are generated using an Simple Recurrent Network. We find that enriching the feature set with text embeddings substantially lowers word error rates on an English tweet normalization dataset. Our model improves on stateof- the-art with little training data and without any lexical resources. © 2014 Association for Computational Linguistics.

Cite

CITATION STYLE

APA

Chrupała, G. (2014). Normalizing tweets with edit scripts and recurrent neural embeddings. In 52nd Annual Meeting of the Association for Computational Linguistics, ACL 2014 - Proceedings of the Conference (Vol. 2, pp. 680–686). Association for Computational Linguistics (ACL). https://doi.org/10.3115/v1/p14-2111

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free