Abstract
We present a method that improves data selection by combining a hybrid word/part-of-speech representation for corpora, with the idea of distinguishing between rare and frequent events. We validate our approach using data selection for machine translation, and show that it maintains or improves BLEU and TER translation scores while substantially improving vocabulary coverage and reducing data selection model size. Paradoxically, the coverage improvement is achieved by abstracting away over 97% of the total training corpus vocabulary using simple part-of-speech tags during the data selection process.
Cite
CITATION STYLE
Axelrod, A., He, X., Resnik, P., & Ostendorf, M. (2015). Data selectionwith fewerwords. In 10th Workshop on Statistical Machine Translation, WMT 2015 at the 2015 Conference on Empirical Methods in Natural Language Processing, EMNLP 2015 - Proceedings (pp. 58–65). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/w15-3003
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.