Abstract
In this paper we revisit automatic metrics for paraphrase evaluation and obtain two findings that disobey conventional wisdom: (1) Reference-free metrics achieve better performance than their reference-based counterparts. (2) Most commonly used metrics do not align well with human annotation. Underlying reasons behind the above findings are explored through additional experiments and in-depth analyses. Based on the experiments and analyses, we propose ParaScore, a new evaluation metric for paraphrase generation. It possesses the merits of reference-based and reference-free metrics and explicitly models lexical divergence. Based on our analysis and improvements, our proposed reference-based outperforms than reference-free metrics. Experimental results demonstrate that ParaScore significantly outperforms existing metrics. Our codes and toolkit are released in https://github.com/shadowkiller33/ParaScore.
Cite
CITATION STYLE
Shen, L., Liu, L., Jiang, H., & Shi, S. (2022). On the Evaluation Metrics for Paraphrase Generation. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022 (pp. 3178–3190). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2022.emnlp-main.208
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.