Sentence mover's similarity: Automatic evaluation for multi-sentence texts

101Citations
Citations of this article
247Readers
Mendeley users who have this article in their library.

Abstract

For evaluating machine-generated texts, automatic methods hold the promise of avoiding collection of human judgments, which can be expensive and time-consuming. The most common automatic metrics, like BLEU and ROUGE, depend on exact word matching, an inflexible approach for measuring semantic similarity. We introduce methods based on sentence mover's similarity; our automatic metrics evaluate text in a continuous space using word and sentence embeddings. We find that sentence-based metrics correlate with human judgments significantly better than ROUGE, both on machine-generated summaries (average length of 3.4 sentences) and human-authored essays (average length of 7.5). We also show that sentence mover's similarity can be used as a reward when learning a generation model via reinforcement learning; we present both automatic and human evaluations of summaries learned in this way, finding that our approach outperforms ROUGE.

Cite

CITATION STYLE

APA

Clark, E., Celikyilmaz, A., & Smith, N. A. (2020). Sentence mover’s similarity: Automatic evaluation for multi-sentence texts. In ACL 2019 - 57th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference (pp. 2748–2760). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/p19-1264

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free