Looking for a few good metrics: Automatic summarization evaluation - how many samples are enough

  • Lin C
  • 25


    Mendeley users who have this article in their library.
  • N/A


    Citations of this article.


ROUGE stands for Recall-Oriented Understudy for Gisting Evaluation. It includes measures to auto- matically determine the quality of a summary by comparing it to other (ideal) summaries created by humans. The measures count the number of overlap- ping units such as n-gram, word sequences, and word pairs between the computer-generated sum- mary to be evaluated and the ideal summaries cre- ated by humans. This paper discusses the validity of the evaluation method used in the Document Under- standing Conference (DUC) and evaluates five dif- ferent ROUGE metrics: ROUGE-N, ROUGE-L, ROUGE- W, ROUGE-S, and ROUGE-SU included in the ROUGE summarization evaluation package using data pro- vided by DUC. A comprehensive study of the effects of using single or multiple references and various sample sizes on the stability of the results is also presented.

Author-supplied keywords

  • and
  • automatic evaluation
  • document understanding conference
  • duc
  • e
  • however
  • i
  • longest common subsequence
  • r ouge
  • similarity
  • summarization
  • they did not
  • unigram or bigram
  • unit overlap

Get free article suggestions today

Mendeley saves you time finding and organizing research

Sign up here
Already have an account ?Sign in

Find this document

There are no full text links


  • Chin-Yew Lin

Cite this document

Choose a citation style from the tabs below

Save time finding and organizing research with Mendeley

Sign up for free