MIRTT: Learning Multimodal Interaction Representations from Trilinear Transformers for Visual Question Answering

Junjie Wang; Yatai Ji; Jiaqi Sun; Yujiu Yang; Tetsuya Sakai

Conference Proceedings

MIRTT: Learning Multimodal Interaction Representations from Trilinear Transformers for Visual Question Answering

Findings of the Association for Computational Linguistics, Findings of ACL: EMNLP 2021 (2021) 2280-2292

DOI: 10.18653/v1/2021.findings-emnlp.196

16Citations

52Readers

Get full text

Abstract

In Visual Question Answering (VQA), existing bilinear methods focus on the interaction between images and questions. As a result, the answers are either spliced into the questions or utilized as labels only for classification. On the other hand, trilinear models such as the CTI model of Do et al. (2019) efficiently utilize the inter-modality information between answers, questions, and images, while ignoring intramodality information. Inspired by these observations, we propose a new trilinear interaction framework called MIRTT (Learning Multimodal Interaction Representations from Trilinear Transformers), incorporating the attention mechanisms for capturing inter-modality and intra-modality relationships. Moreover, we design a two-stage workflow where a bilinear model reduces the free-form, open-ended VQA problem into a multiple-choice VQA problem. Furthermore, to obtain accurate and generic multimodal representations, we pretrain MIRTT with masked language prediction. Our method achieves state-of-the-art performance on the Visual7W Telling task and VQA1.0 Multiple Choice task and outperforms bilinear baselines on the VQA-2.0, TDIUC and GQA datasets.

Cite

CITATION STYLE

APA

Wang, J., Ji, Y., Sun, J., Yang, Y., & Sakai, T. (2021). MIRTT: Learning Multimodal Interaction Representations from Trilinear Transformers for Visual Question Answering. In Findings of the Association for Computational Linguistics, Findings of ACL: EMNLP 2021 (pp. 2280–2292). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2021.findings-emnlp.196

MIRTT: Learning Multimodal Interaction Representations from Trilinear Transformers for Visual Question Answering

Abstract

Cite

Register to see more suggestions