Disentangling the Properties of Human Evaluation Methods: A Classification System to Support Comparability, Meta-Evaluation and Reproducibility Testing

Anya Belz; Simon Mille; David M. Howcroft

Conference ProceedingsOPEN ACCESS

Disentangling the Properties of Human Evaluation Methods: A Classification System to Support Comparability, Meta-Evaluation and Reproducibility Testing

INLG 2020 - 13th International Conference on Natural Language Generation, Proceedings (2020) 183-194

DOI: 10.18653/v1/2020.inlg-1.24

42Citations

79Readers

Abstract

Current standards for designing and reporting human evaluations in NLP mean it is generally unclear which evaluations are comparable and can be expected to yield similar results when applied to the same system outputs. This has serious implications for reproducibility testing and meta-evaluation, in particular given that human evaluation is considered the gold standard against which the trustworthiness of automatic metrics is gauged. Using examples from NLG, we propose a classification system for evaluations based on disentangling (i) what is being evaluated (which aspect of quality), and (ii) how it is evaluated in specific (a) evaluation modes and (b) experimental designs. We show that this approach provides a basis for determining comparability, hence for comparison of evaluations across papers, meta-evaluation experiments, reproducibility testing.

Cite

CITATION STYLE

APA

Belz, A., Mille, S., & Howcroft, D. M. (2020). Disentangling the Properties of Human Evaluation Methods: A Classification System to Support Comparability, Meta-Evaluation and Reproducibility Testing. In INLG 2020 - 13th International Conference on Natural Language Generation, Proceedings (pp. 183–194). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2020.inlg-1.24

Disentangling the Properties of Human Evaluation Methods: A Classification System to Support Comparability, Meta-Evaluation and Reproducibility Testing

Abstract

Cite

Register to see more suggestions