ROUGE stands for Recall-Oriented Understudy for Gisting Evaluation. It includes measures to auto- matically determine the quality of a summary by comparing it to other (ideal) summaries created by humans. The measures count the number of overlap- ping units such as n-gram, word sequences, and word pairs between the computer-generated sum- mary to be evaluated and the ideal summaries cre- ated by humans. This paper discusses the validity of the evaluation method used in the Document Under- standing Conference (DUC) and evaluates five dif- ferent ROUGE metrics: ROUGE-N, ROUGE-L, ROUGE- W, ROUGE-S, and ROUGE-SU included in the ROUGE summarization evaluation package using data pro- vided by DUC. A comprehensive study of the effects of using single or multiple references and various sample sizes on the stability of the results is also presented.
CITATION STYLE
Lin, C.-Y. (2004). Looking for a few good metrics: Automatic summarization evaluation - how many samples are enough. In Proceedings of the NTCIR Workshop (pp. 1765–1776).
Mendeley helps you to discover research relevant for your work.