How to Compare Summarizers without Target Length? Pitfalls, Solutions and Re-Examination of the Neural Summarization Literature

  • Sun S
  • Shapira O
  • Dagan I
  • et al.
N/ACitations
Citations of this article
80Readers
Mendeley users who have this article in their library.
Get full text

Abstract

We show that plain ROUGE F1 scores are not ideal for comparing current neural systems which on average produce different lengths. This is due to a non-linear pattern between ROUGE F1 and summary length. To alleviate the effect of length during evaluation, we have proposed a new method which normalizes the ROUGE F1 scores of a system by that of a random system with same average output length. A pilot human evaluation has shown that humans prefer short summaries in terms of the verbosity of a summary but overall consider longer summaries to be of higher quality. While human evaluations are more expensive in time and resources, it is clear that normalization, such as the one we proposed for automatic evaluation, will make human evaluations more meaningful.

Cite

CITATION STYLE

APA

Sun, S., Shapira, O., Dagan, I., & Nenkova, A. (2019). How to Compare Summarizers without Target Length? Pitfalls, Solutions and Re-Examination of the Neural Summarization Literature (pp. 21–29). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/w19-2303

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free