ERNIE-ViL: Knowledge Enhanced Vision-Language Representations through Scene Graphs

244Citations
Citations of this article
329Readers
Mendeley users who have this article in their library.

Abstract

We propose a knowledge-enhanced approach, ERNIE-ViL, which incorporates structured knowledge obtained from scene graphs to learn joint representations of vision-language. ERNIE-ViL tries to build the detailed semantic connections (objects, attributes of objects and relationships between objects) across vision and language, which are essential to vision-language cross-modal tasks. Utilizing scene graphs of visual scenes, ERNIE-ViL constructs Scene Graph Prediction tasks, i.e., Object Prediction, Attribute Prediction and Relationship Prediction tasks in the pre-training phase. Specifically, these prediction tasks are implemented by predicting nodes of different types in the scene graph parsed from the sentence. Thus, ERNIE-ViL can learn the joint representations characterizing the alignments of the detailed semantics across vision and language. After pre-training on large scale image-text aligned datasets, we validate the effectiveness of ERNIE-ViL on 5 cross-modal downstream tasks. ERNIE-ViL achieves state-of-the-art performances on all these tasks and ranks the first place on the VCR leaderboard with an absolute improvement of 3.7%.

Cite

CITATION STYLE

APA

Yu, F., Tang, J., Yin, W., Sun, Y., Tian, H., Wu, H., & Wang, H. (2021). ERNIE-ViL: Knowledge Enhanced Vision-Language Representations through Scene Graphs. In 35th AAAI Conference on Artificial Intelligence, AAAI 2021 (Vol. 4B, pp. 3208–3216). Association for the Advancement of Artificial Intelligence. https://doi.org/10.1609/aaai.v35i4.16431

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free