Crisscrossed captions: Extended intramodal and intermodal semantic similarity judgments for MS-COCO

26Citations
Citations of this article
110Readers
Mendeley users who have this article in their library.
Get full text

Abstract

By supporting multi-modal retrieval training and evaluation, image captioning datasets have spurred remarkable progress on representation learning. Unfortunately, datasets have limited cross-modal associations: images are not paired with other images, captions are only paired with other captions of the same image, there are no negative associations and there are missing positive cross-modal associations. This undermines research into how inter-modality learning impacts intra-modality tasks. We address this gap with Crisscrossed Captions (CxC), an extension of the MS-COCO dataset with human semantic similarity judgments for 267,095 intra- and intermodality pairs. We report baseline results on CxC for strong existing unimodal and multimodal models. We also evaluate a multitask dual encoder trained on both image-caption and caption-caption pairs that crucially demonstrates CxC's value for measuring the influence of intra- and inter-modality learning.

Cite

CITATION STYLE

APA

Parekh, Z., Baldridge, J., Cer, D., Waters, A., & Yang, Y. (2021). Crisscrossed captions: Extended intramodal and intermodal semantic similarity judgments for MS-COCO. In EACL 2021 - 16th Conference of the European Chapter of the Association for Computational Linguistics, Proceedings of the Conference (pp. 2855–2870). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2021.eacl-main.249

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free