Dialect-robust Evaluation of Generated Text

Jiao Sun; Thibault Sellam; Elizabeth Clark; Tu Vu; Timothy Dozat; Dan Garrette; Aditya Siddhant; Jacob Eisenstein; Sebastian Gehrmann

Conference ProceedingsOPEN ACCESS

Dialect-robust Evaluation of Generated Text

Proceedings of the Annual Meeting of the Association for Computational Linguistics (2023) 1 6010-6028

DOI: 10.18653/v1/2023.acl-long.331

2Citations

13Readers

Abstract

Text generation metrics that are not robust to dialect variation make it impossible to tell how well systems perform for many groups of users, and can even penalize systems for producing text in lower-resource dialects. In this paper, we introduce a suite of methods to assess whether metrics are dialect robust. These methods show that state-of-the-art metrics are not dialect robust: they often prioritize dialect similarity over semantics, preferring outputs that are semantically incorrect over outputs that match the semantics of the reference but contain dialect differences. As a step towards dialect-robust metrics for text generation, we propose NANO, which introduces regional and language information to the metric's pretraining. NANO significantly improves dialect robustness while preserving the correlation between automated metrics and human ratings. It also enables a more ambitious approach to evaluation, dialect awareness, in which system outputs are scored by both semantic match to the reference and appropriateness in any specified dialect.

Cite

CITATION STYLE

APA

Sun, J., Sellam, T., Clark, E., Vu, T., Dozat, T., Garrette, D., … Gehrmann, S. (2023). Dialect-robust Evaluation of Generated Text. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (Vol. 1, pp. 6010–6028). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2023.acl-long.331

Dialect-robust Evaluation of Generated Text

Abstract

Cite

Register to see more suggestions