Abstract
In Visual Question Answering (VQA), existing bilinear methods focus on the interaction between images and questions. As a result, the answers are either spliced into the questions or utilized as labels only for classification. On the other hand, trilinear models such as the CTI model of Do et al. (2019) efficiently utilize the inter-modality information between answers, questions, and images, while ignoring intramodality information. Inspired by these observations, we propose a new trilinear interaction framework called MIRTT (Learning Multimodal Interaction Representations from Trilinear Transformers), incorporating the attention mechanisms for capturing inter-modality and intra-modality relationships. Moreover, we design a two-stage workflow where a bilinear model reduces the free-form, open-ended VQA problem into a multiple-choice VQA problem. Furthermore, to obtain accurate and generic multimodal representations, we pretrain MIRTT with masked language prediction. Our method achieves state-of-the-art performance on the Visual7W Telling task and VQA1.0 Multiple Choice task and outperforms bilinear baselines on the VQA-2.0, TDIUC and GQA datasets.
Cite
CITATION STYLE
Wang, J., Ji, Y., Sun, J., Yang, Y., & Sakai, T. (2021). MIRTT: Learning Multimodal Interaction Representations from Trilinear Transformers for Visual Question Answering. In Findings of the Association for Computational Linguistics, Findings of ACL: EMNLP 2021 (pp. 2280–2292). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2021.findings-emnlp.196
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.