Dialect-robust Evaluation of Generated Text

2Citations
Citations of this article
13Readers
Mendeley users who have this article in their library.

Abstract

Text generation metrics that are not robust to dialect variation make it impossible to tell how well systems perform for many groups of users, and can even penalize systems for producing text in lower-resource dialects. In this paper, we introduce a suite of methods to assess whether metrics are dialect robust. These methods show that state-of-the-art metrics are not dialect robust: they often prioritize dialect similarity over semantics, preferring outputs that are semantically incorrect over outputs that match the semantics of the reference but contain dialect differences. As a step towards dialect-robust metrics for text generation, we propose NANO, which introduces regional and language information to the metric's pretraining. NANO significantly improves dialect robustness while preserving the correlation between automated metrics and human ratings. It also enables a more ambitious approach to evaluation, dialect awareness, in which system outputs are scored by both semantic match to the reference and appropriateness in any specified dialect.

Cite

CITATION STYLE

APA

Sun, J., Sellam, T., Clark, E., Vu, T., Dozat, T., Garrette, D., … Gehrmann, S. (2023). Dialect-robust Evaluation of Generated Text. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (Vol. 1, pp. 6010–6028). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2023.acl-long.331

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free