Comparing automatic evaluation measures for image description

96Citations
Citations of this article
156Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Image description is a new natural language generation task, where the aim is to generate a human-like description of an image. The evaluation of computer-generated text is a notoriously difficult problem, however, the quality of image descriptions has typically been measured using unigram BLEU and human judgements. The focus of this paper is to determine the correlation of automatic measures with human judgements for this task. We estimate the correlation of unigram and Smoothed BLEU, TER, ROUGE-SU4, and Meteor against human judgements on two data sets. The main finding is that unigram BLEU has a weak correlation, and Meteor has the strongest correlation with human judgements. © 2014 Association for Computational Linguistics.

Cite

CITATION STYLE

APA

Elliott, D., & Keller, F. (2014). Comparing automatic evaluation measures for image description. In 52nd Annual Meeting of the Association for Computational Linguistics, ACL 2014 - Proceedings of the Conference (Vol. 2, pp. 452–457). Association for Computational Linguistics (ACL). https://doi.org/10.3115/v1/p14-2074

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free