On the Evaluation Metrics for Paraphrase Generation

Lingfeng Shen; Lemao Liu; Haiyun Jiang; Shuming Shi

Conference Proceedings

On the Evaluation Metrics for Paraphrase Generation

Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022 (2022) 3178-3190

DOI: 10.18653/v1/2022.emnlp-main.208

31Citations

47Readers

Get full text

Abstract

In this paper we revisit automatic metrics for paraphrase evaluation and obtain two findings that disobey conventional wisdom: (1) Reference-free metrics achieve better performance than their reference-based counterparts. (2) Most commonly used metrics do not align well with human annotation. Underlying reasons behind the above findings are explored through additional experiments and in-depth analyses. Based on the experiments and analyses, we propose ParaScore, a new evaluation metric for paraphrase generation. It possesses the merits of reference-based and reference-free metrics and explicitly models lexical divergence. Based on our analysis and improvements, our proposed reference-based outperforms than reference-free metrics. Experimental results demonstrate that ParaScore significantly outperforms existing metrics. Our codes and toolkit are released in https://github.com/shadowkiller33/ParaScore.

Cite

CITATION STYLE

APA

Shen, L., Liu, L., Jiang, H., & Shi, S. (2022). On the Evaluation Metrics for Paraphrase Generation. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022 (pp. 3178–3190). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2022.emnlp-main.208

On the Evaluation Metrics for Paraphrase Generation

Abstract

Cite

Register to see more suggestions