Learning to Select Question-Relevant Relations for Visual Question Answering

3Citations
Citations of this article
49Readers
Mendeley users who have this article in their library.

Abstract

Previous existing visual question answering (VQA) systems commonly use graph neural networks(GNNs) to extract visual relationships such as semantic relations or spatial relations. However, studies that use GNNs typically ignore the importance of each relation and simply concatenate outputs from multiple relation encoders. In this paper, we propose a novel layer architecture that fuses multiple visual relations through an attention mechanism to address this issue. Specifically, we develop a model that uses question embedding and joint embedding of the encoders to obtain dynamic attention weights with regard to the type of questions. Using the learnable attention weights, the proposed model can efficiently use the necessary visual relation features for a given question. Experimental results on the VQA 2.0 dataset demonstrate that the proposed model outperforms existing graph attention network-based architectures. Additionally, we visualize the attention weight and show that the proposed model assigns a higher weight to relations that are more relevant to the question.

Cite

CITATION STYLE

APA

Lee, J., Lee, H., Lee, H., & Jung, K. (2021). Learning to Select Question-Relevant Relations for Visual Question Answering. In Multimodal Artificial Intelligence, MAI Workshop 2021 - Proceedings of the 3rd Workshop (pp. 87–96). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2021.maiworkshop-1.13

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free