Abstract
By supporting multi-modal retrieval training and evaluation, image captioning datasets have spurred remarkable progress on representation learning. Unfortunately, datasets have limited cross-modal associations: images are not paired with other images, captions are only paired with other captions of the same image, there are no negative associations and there are missing positive cross-modal associations. This undermines research into how inter-modality learning impacts intra-modality tasks. We address this gap with Crisscrossed Captions (CxC), an extension of the MS-COCO dataset with human semantic similarity judgments for 267,095 intra- and intermodality pairs. We report baseline results on CxC for strong existing unimodal and multimodal models. We also evaluate a multitask dual encoder trained on both image-caption and caption-caption pairs that crucially demonstrates CxC's value for measuring the influence of intra- and inter-modality learning.
Cite
CITATION STYLE
Parekh, Z., Baldridge, J., Cer, D., Waters, A., & Yang, Y. (2021). Crisscrossed captions: Extended intramodal and intermodal semantic similarity judgments for MS-COCO. In EACL 2021 - 16th Conference of the European Chapter of the Association for Computational Linguistics, Proceedings of the Conference (pp. 2855–2870). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2021.eacl-main.249
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.