MENLI: Robust Evaluation Metrics from Natural Language Inference

35Citations
Citations of this article
36Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Recently proposed BERT-based evaluation metrics for text generation perform well on standard benchmarks but are vulnerable to ad¬versarial attacks, e.g., relating to information correctness. We argue that this stems (in part) from the fact that they are models of semantic similarity. In contrast, we develop evaluation metrics based on Natural Language Inference (NLI), which we deem a more appropriate modeling. We design a preference-based ad¬versarial attack framework and show that our NLI based metrics are much more robust to the attacks than the recent BERT-based metrics. On standard benchmarks, our NLI based metrics outperform existing summariza¬tion metrics, but perform below SOTA MT metrics. However, when combining existing metrics with our NLI metrics, we obtain both higher adversarial robustness (15%-30%) and higher quality metrics as measured on standard benchmarks (+5% to 30%).

Cite

CITATION STYLE

APA

Chen, Y., & Eger, S. (2023). MENLI: Robust Evaluation Metrics from Natural Language Inference. Transactions of the Association for Computational Linguistics, 11, 804–825. https://doi.org/10.1162/tacl_a_00576

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free