EViLBERT: Learning task-agnostic multimodal sense embeddings

8Citations
Citations of this article
23Readers
Mendeley users who have this article in their library.

Abstract

The problem of grounding language in vision is increasingly attracting scholarly efforts. As of now, however, most of the approaches have been limited to word embeddings, which are not capable of handling polysemous words. This is mainly due to the limited coverage of the available semantically-annotated datasets, hence forcing research to rely on alternative technologies (i.e., image search engines). To address this issue, we introduce EViLBERT, an approach which is able to perform image classification over an open set of concepts, both concrete and non-concrete. Our approach is based on the recently introduced Vision-Language Pretraining (VLP) model, and builds upon a manually-annotated dataset of concept-image pairs. We use our technique to clean up the image-to-concept mapping that is provided within a multilingual knowledge base, resulting in over 258,000 images associated with 42,500 concepts. We show that our VLP-based model can be used to create multimodal sense embeddings starting from our automatically-created dataset. In turn, we also show that these multimodal embeddings improve the performance of a Word Sense Disambiguation architecture over a strong unimodal baseline. We release code, dataset and embeddings at http://babelpic.org.

Cite

CITATION STYLE

APA

Calabrese, A., Bevilacqua, M., & Navigli, R. (2020). EViLBERT: Learning task-agnostic multimodal sense embeddings. In IJCAI International Joint Conference on Artificial Intelligence (Vol. 2021-January, pp. 481–487). International Joint Conferences on Artificial Intelligence. https://doi.org/10.24963/ijcai.2020/67

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free