Learning to Select Question-Relevant Relations for Visual Question Answering

Jaewoong Lee; Heejoon Lee; Hwanhee Lee; Kyomin Jung

Conference ProceedingsOPEN ACCESS

Learning to Select Question-Relevant Relations for Visual Question Answering

Multimodal Artificial Intelligence, MAI Workshop 2021 - Proceedings of the 3rd Workshop (2021) 87-96

DOI: 10.18653/v1/2021.maiworkshop-1.13

3Citations

49Readers

Abstract

Previous existing visual question answering (VQA) systems commonly use graph neural networks(GNNs) to extract visual relationships such as semantic relations or spatial relations. However, studies that use GNNs typically ignore the importance of each relation and simply concatenate outputs from multiple relation encoders. In this paper, we propose a novel layer architecture that fuses multiple visual relations through an attention mechanism to address this issue. Specifically, we develop a model that uses question embedding and joint embedding of the encoders to obtain dynamic attention weights with regard to the type of questions. Using the learnable attention weights, the proposed model can efficiently use the necessary visual relation features for a given question. Experimental results on the VQA 2.0 dataset demonstrate that the proposed model outperforms existing graph attention network-based architectures. Additionally, we visualize the attention weight and show that the proposed model assigns a higher weight to relations that are more relevant to the question.

Cite

CITATION STYLE

APA

Lee, J., Lee, H., Lee, H., & Jung, K. (2021). Learning to Select Question-Relevant Relations for Visual Question Answering. In Multimodal Artificial Intelligence, MAI Workshop 2021 - Proceedings of the 3rd Workshop (pp. 87–96). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2021.maiworkshop-1.13

Learning to Select Question-Relevant Relations for Visual Question Answering

Abstract

Cite

Register to see more suggestions