Be different to be better! A benchmark to leverage the complementarity of language and vision

14Citations
Citations of this article
69Readers
Mendeley users who have this article in their library.

Abstract

This paper introduces BD2BB, a novel language and vision benchmark that requires multimodal models combine complementary information from the two modalities. Recently, impressive progress has been made to develop universal multimodal encoders suitable for virtually any language and vision tasks. However, current approaches often require them to combine redundant information provided by language and vision. Inspired by real-life communicative contexts, we propose a novel task where either modality is necessary but not sufficient to make a correct prediction. To do so, we first build a dataset of images and corresponding sentences provided by human participants. Second, we evaluate state-of-the-art models and compare their performance against human speakers. We show that, while the task is relatively easy for humans, best-performing models struggle to achieve similar results.

Cite

CITATION STYLE

APA

Pezzelle, S., Greco, C., Gandolfi, G., Gualdoni, E., & Bernardi, R. (2020). Be different to be better! A benchmark to leverage the complementarity of language and vision. In Findings of the Association for Computational Linguistics Findings of ACL: EMNLP 2020 (pp. 2751–2767). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2020.findings-emnlp.248

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free