On the Evaluation Metrics for Paraphrase Generation

31Citations
Citations of this article
47Readers
Mendeley users who have this article in their library.
Get full text

Abstract

In this paper we revisit automatic metrics for paraphrase evaluation and obtain two findings that disobey conventional wisdom: (1) Reference-free metrics achieve better performance than their reference-based counterparts. (2) Most commonly used metrics do not align well with human annotation. Underlying reasons behind the above findings are explored through additional experiments and in-depth analyses. Based on the experiments and analyses, we propose ParaScore, a new evaluation metric for paraphrase generation. It possesses the merits of reference-based and reference-free metrics and explicitly models lexical divergence. Based on our analysis and improvements, our proposed reference-based outperforms than reference-free metrics. Experimental results demonstrate that ParaScore significantly outperforms existing metrics. Our codes and toolkit are released in https://github.com/shadowkiller33/ParaScore.

Cite

CITATION STYLE

APA

Shen, L., Liu, L., Jiang, H., & Shi, S. (2022). On the Evaluation Metrics for Paraphrase Generation. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022 (pp. 3178–3190). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2022.emnlp-main.208

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free