Multimodal Encoder-Decoder Attention Networks for Visual Question Answering

Chongqing Chen; Dezhi Han; Jun Wang

Journal ArticleOPEN ACCESS

Multimodal Encoder-Decoder Attention Networks for Visual Question Answering

IEEE Access (2020) 8 35662-35671

DOI: 10.1109/ACCESS.2020.2975093

53Citations

37Readers

Abstract

Visual Question Answering (VQA) is a multimodal task involving Computer Vision (CV) and Natural Language Processing (NLP), the goal is to establish a high-efficiency VQA model. Learning a fine-grained and simultaneous understanding of both the visual content of images and the textual content of questions is the heart of VQA. In this paper, a novel Multimodal Encoder-Decoder Attention Networks (MEDAN) is proposed. The MEDAN consists of Multimodal Encoder-Decoder Attention (MEDA) layers cascaded in depth, and can capture rich and reasonable question features and image features by associating keywords in question with important object regions in image. Each MEDA layer contains an Encoder module modeling the self-attention of questions, as well as a Decoder module modeling the question-guided-attention and self-attention of images. Experimental evaluation results on the benchmark VQA-v2 dataset demonstrate that MEDAN achieves state-of-the-art VQA performance. With the Adam solver, our best single model delivers 71.01% overall accuracy on the test-std set, and with the AdamW solver, we achieve an overall accuracy of 70.76% on the test-dev set. Additionally, extensive ablation studies are conducted to explore the reasons for MEDAN's effectiveness.

Author supplied keywords

Cite

CITATION STYLE

APA

Chen, C., Han, D., & Wang, J. (2020). Multimodal Encoder-Decoder Attention Networks for Visual Question Answering. IEEE Access, 8, 35662–35671. https://doi.org/10.1109/ACCESS.2020.2975093

Multimodal Encoder-Decoder Attention Networks for Visual Question Answering

Abstract

Author supplied keywords

Cite

Register to see more suggestions