MIRTT: Learning Multimodal Interaction Representations from Trilinear Transformers for Visual Question Answering

16Citations
Citations of this article
52Readers
Mendeley users who have this article in their library.
Get full text

Abstract

In Visual Question Answering (VQA), existing bilinear methods focus on the interaction between images and questions. As a result, the answers are either spliced into the questions or utilized as labels only for classification. On the other hand, trilinear models such as the CTI model of Do et al. (2019) efficiently utilize the inter-modality information between answers, questions, and images, while ignoring intramodality information. Inspired by these observations, we propose a new trilinear interaction framework called MIRTT (Learning Multimodal Interaction Representations from Trilinear Transformers), incorporating the attention mechanisms for capturing inter-modality and intra-modality relationships. Moreover, we design a two-stage workflow where a bilinear model reduces the free-form, open-ended VQA problem into a multiple-choice VQA problem. Furthermore, to obtain accurate and generic multimodal representations, we pretrain MIRTT with masked language prediction. Our method achieves state-of-the-art performance on the Visual7W Telling task and VQA1.0 Multiple Choice task and outperforms bilinear baselines on the VQA-2.0, TDIUC and GQA datasets.

Cite

CITATION STYLE

APA

Wang, J., Ji, Y., Sun, J., Yang, Y., & Sakai, T. (2021). MIRTT: Learning Multimodal Interaction Representations from Trilinear Transformers for Visual Question Answering. In Findings of the Association for Computational Linguistics, Findings of ACL: EMNLP 2021 (pp. 2280–2292). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2021.findings-emnlp.196

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free