FOIL it! Find One mismatch between image and language caption

Ravi Shekhar; Sandro Pezzelle; Yauhen Klimovich; Aurélie Herbelot; Moin Nabi; Enver Sangineto; Raffaella Bernardi

Conference ProceedingsOPEN ACCESS

FOIL it! Find One mismatch between image and language caption

ACL 2017 - 55th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference (Long Papers) (2017) 1 255-265

DOI: 10.18653/v1/P17-1024

102Citations

168Readers

Abstract

In this paper, we aim to understand whether current language and vision (LaVi) models truly grasp the interaction between the two modalities. To this end, we propose an extension of the MS-COCO dataset, FOIL-COCO, which associates images with both correct and ‘foil' captions, that is, descriptions of the image that are highly similar to the original ones, but contain one single mistake (‘foil word'). We show that current LaVi models fall into the traps of this data and perform badly on three tasks: a) caption classification (correct vs. foil); b) foil word detection; c) foil word correction. Humans, in contrast, have near-perfect performance on those tasks. We demonstrate that merely utilising language cues is not enough to model FOIL-COCO and that it challenges the state-of-the-art by requiring a fine-grained understanding of the relation between text and image.

Cite

CITATION STYLE

APA

Shekhar, R., Pezzelle, S., Klimovich, Y., Herbelot, A., Nabi, M., Sangineto, E., & Bernardi, R. (2017). FOIL it! Find One mismatch between image and language caption. In ACL 2017 - 55th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference (Long Papers) (Vol. 1, pp. 255–265). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/P17-1024

FOIL it! Find One mismatch between image and language caption

Abstract

Cite

Register to see more suggestions