Human evaluation is widely regarded as the litmus test of quality in NLP. A basic requirement of all evaluations, but in particular where used for meta-evaluation, is that they should support the same conclusions if repeated. However, the reproducibility of human evaluations is virtually never queried in NLP, let alone formally tested, and their repeatability and reproducibility of results is currently an open question. This paper reports our review of human evaluation experiments published in NLP papers over the past five years which we assessed in terms of (i) their ability to be rerun, and (ii) their results being reproduced where they can be rerun. Overall, we estimate that just 5% of human evaluations are repeatable in the sense that (i) there are no prohibitive barriers to repetition, and (ii) sufficient information about experimental design is publicly available for rerunning them. Our estimate goes up to about 20% when author help is sought. We complement this investigation with a survey of results concerning the reproducibility of human evaluations where those are repeatable in the first place. Here we find worryingly low degrees of reproducibility, both in terms of similarity of scores and of the findings supported by them. We summarise what insights can be gleaned so far regarding how to make human evaluations in NLP more repeatable and more reproducible.
CITATION STYLE
Belz, A., Thomson, C., Reiter, E., & Mille, S. (2023). Non-Repeatable Experiments and Non-Reproducible Results: The Reproducibility Crisis in Human Evaluation in NLP. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (pp. 3676–3687). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2023.findings-acl.226
Mendeley helps you to discover research relevant for your work.