Abstract
We introduce Discriminative BLEU (ΔABLEU), a novel metric for intrinsic evaluation of generated text in tasks that admit a diverse range of possible outputs. Reference strings are scored for quality by human raters on a scale of [-1, +1] to weight multi-reference bleu. In tasks involving generation of conversational responses, ΔBLEU correlates reasonably with human judgments and outperforms sentence-level and IBM Bleu in terms of both Spearman's p and Kendall's τ.
Cite
CITATION STYLE
Galley, M., Brockett, C., Sordoni, A., Ji, Y., Aim, M., Quirk, C., … Dolan, B. (2015). ΔbLEU: A discriminative metric for generation tasks with intrinsically diverse targets. In ACL-IJCNLP 2015 - 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, Proceedings of the Conference (Vol. 2, pp. 445–450). Association for Computational Linguistics (ACL). https://doi.org/10.3115/v1/P15-2073
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.