Crowdsourcing high-quality parallel data extraction from twitter

12Citations
Citations of this article
95Readers
Mendeley users who have this article in their library.
Get full text

Abstract

High-quality parallel data is crucial for a range of multilingual applications, from tuning and evaluating machine translation systems to cross-lingual annotation projection. Unfortunately, automatically obtained parallel data (which is available in relative abundance) tends to be quite noisy. To obtain high-quality parallel data, we introduce a crowdsourcing paradigm in which workers with only basic bilingual proficiency identify translations from an automatically extracted corpus of parallel microblog messages. For less than $350, we obtained over 5000 parallel segments in five language pairs. Evaluated against expert annotations, the quality of the crowdsourced corpus is significantly better than existing automatic methods: it obtains an performance comparable to expert annotations when used in MERT tuning of a microblog MT system; and training a parallel sentence classifier with it leads also to improved results. The crowdsourced corpora will be made available in http://www.cs.cmu.edu/ lingwang/microtopia/.

Cite

CITATION STYLE

APA

Ling, W., Marujo, L., Dyer, C., Black, A., & Trancoso, I. (2014). Crowdsourcing high-quality parallel data extraction from twitter. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (pp. 426–436). Association for Computational Linguistics (ACL). https://doi.org/10.3115/v1/w14-3356

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free