Probing Image-Language Transformers for Verb Understanding

39Citations
Citations of this article
104Readers
Mendeley users who have this article in their library.

Abstract

Multimodal image-language transformers have achieved impressive results on a variety of tasks that rely on fine-tuning (e.g., visual question answering and image retrieval). We are interested in shedding light on the quality of their pretrained representations - in particular, if these models can distinguish different types of verbs or if they rely solely on nouns in a given sentence. To do so, we collect a dataset of image-sentence pairs (in English) consisting of 421 verbs that are either visual or commonly found in the pretraining data (i.e., the Conceptual Captions dataset). We use this dataset to evaluate pretrained image-language transformers and find that they fail more in situations that require verb understanding compared to other parts of speech. We also investigate what category of verbs are particularly challenging.

Cite

CITATION STYLE

APA

Hendricks, L. A., & Nematzadeh, A. (2021). Probing Image-Language Transformers for Verb Understanding. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021 (pp. 3635–3644). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2021.findings-acl.318

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free