Abstract
High-quality parallel data is crucial for a range of multilingual applications, from tuning and evaluating machine translation systems to cross-lingual annotation projection. Unfortunately, automatically obtained parallel data (which is available in relative abundance) tends to be quite noisy. To obtain high-quality parallel data, we introduce a crowdsourcing paradigm in which workers with only basic bilingual proficiency identify translations from an automatically extracted corpus of parallel microblog messages. For less than $350, we obtained over 5000 parallel segments in five language pairs. Evaluated against expert annotations, the quality of the crowdsourced corpus is significantly better than existing automatic methods: it obtains an performance comparable to expert annotations when used in MERT tuning of a microblog MT system; and training a parallel sentence classifier with it leads also to improved results. The crowdsourced corpora will be made available in http://www.cs.cmu.edu/ lingwang/microtopia/.
Cite
CITATION STYLE
Ling, W., Marujo, L., Dyer, C., Black, A., & Trancoso, I. (2014). Crowdsourcing high-quality parallel data extraction from twitter. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (pp. 426–436). Association for Computational Linguistics (ACL). https://doi.org/10.3115/v1/w14-3356
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.