Evaluation results recently reported by Callison-Burch et al. (2006) and Koehn and Monz (2006), revealed that, in certain cases, the BLEU metric may not be a reliable MT quality indicator. This happens, for instance, when the systems under evaluation are based on different paradigms, and therefore, do not share the same lexicon. The reason is that, while MT quality aspects are diverse, BLEU limits its scope to the lexical dimension. In this work, we suggest using metrics which take into account linguistic features at more abstract levels. We provide experimental results showing that metrics based on deeper linguistic information (syntactic/shallow-semantic) are able to produce more reliable system rankings than metrics based on lexical matching alone, specially when the systems under evaluation are of a different nature.
CITATION STYLE
Giménez, J., & Màrquez, L. (2007). Linguistic features for automatic evaluation of heterogenous MT systems. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (pp. 256–264). Association for Computational Linguistics (ACL). https://doi.org/10.3115/1626355.1626393
Mendeley helps you to discover research relevant for your work.