Abstract
A main output of the annual Workshop on Statistical Machine Translation (WMT) is a ranking of the systems that participated in its shared translation tasks, produced by aggregating pairwise sentencelevel comparisons collected from human judges. Over the past few years, there have been a number of tweaks to the aggregation formula in attempts to address issues arising from the inherent ambiguity and subjectivity of the task, as well as weaknesses in the proposed models and the manner of model selection. We continue this line of work by adapting the TrueSkillTM algorithm an online approach for modeling the relative skills of players in ongoing competitions, such as Microsoft s Xbox Live to the human evaluation of machine translation output. Our experimental results show that TrueSkill outperforms other recently proposed models on accuracy, and also can significantly reduce the number of pairwise annotations that need to be collected by sampling non-uniformly from the space of system competitions.
Cite
CITATION STYLE
Sakaguchi, K., Post, M., & Van Durme, B. (2014). Efficient elicitation of annotations for human evaluation of machine translation. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (pp. 1–11). Association for Computational Linguistics (ACL). https://doi.org/10.3115/v1/w14-3301
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.