Stacked self-attention networks for visual question answering

18Citations
Citations of this article
22Readers
Mendeley users who have this article in their library.

Abstract

Given a photograph, the task of Visual Question Answering (VQA) requires joint image and language understanding to answer a question. It is challenging in effectively extracting the visual representation of images, and efficiently embedding the textual sentences of questions. To address these challenges, we propose a VQA model that utilizes the stacked self-attention for visual understanding, and the BERT-based question embedding model. Particularly, the stacked self-attention mechanism proposed enables the model to not only focus on a simple object but also the relations between objects. Furthermore, the BERT model is learned in an end-to-end manner to better embed the question sentences. Our model is validated on the well-known VQA v2.0 dataset, and achieves the stateof- the-art results.

Cite

CITATION STYLE

APA

Sun, Q., & Fu, Y. (2019). Stacked self-attention networks for visual question answering. In ICMR 2019 - Proceedings of the 2019 ACM International Conference on Multimedia Retrieval (pp. 207–211). Association for Computing Machinery, Inc. https://doi.org/10.1145/3323873.3325044

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free