Abstract
Current pre-trained models applied for summarization are prone to factual inconsistencies that misrepresent the source text. Evaluating the factual consistency of summaries is thus necessary to develop better models. However, the human evaluation setup for evaluating factual consistency has not been standardized. To determine the factors that affect the reliability of the human evaluation, we crowdsource evaluations for factual consistency across state-of-the-art models on two news summarization datasets using the rating-based Likert Scale and ranking-based Best-Worst Scaling. Our analysis reveals that the ranking-based Best-Worst Scaling offers a more reliable measure of summary quality across datasets and that the reliability of Likert ratings highly depends on the target dataset and the evaluation design. To improve crowdsourcing reliability, we extend the scale of the Likert rating and present a scoring algorithm for Best-Worst Scaling that we call value learning. Our crowdsourcing guidelines will be publicly available to facilitate future work on factual consistency in summarization.
Cite
CITATION STYLE
Tang, X., Fabbri, A. R., Mao, Z., Adams, G., Wang, B., Li, H., … Radev, D. (2022). Investigating Crowdsourcing Protocols for Evaluating the Factual Consistency of Summaries. In NAACL 2022 - 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Proceedings of the Conference (pp. 5680–5692). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2022.naacl-main.417
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.