Given a photograph, the task of Visual Question Answering (VQA) requires joint image and language understanding to answer a question. It is challenging in effectively extracting the visual representation of images, and efficiently embedding the textual sentences of questions. To address these challenges, we propose a VQA model that utilizes the stacked self-attention for visual understanding, and the BERT-based question embedding model. Particularly, the stacked self-attention mechanism proposed enables the model to not only focus on a simple object but also the relations between objects. Furthermore, the BERT model is learned in an end-to-end manner to better embed the question sentences. Our model is validated on the well-known VQA v2.0 dataset, and achieves the stateof- the-art results.
CITATION STYLE
Sun, Q., & Fu, Y. (2019). Stacked self-attention networks for visual question answering. In ICMR 2019 - Proceedings of the 2019 ACM International Conference on Multimedia Retrieval (pp. 207–211). Association for Computing Machinery, Inc. https://doi.org/10.1145/3323873.3325044
Mendeley helps you to discover research relevant for your work.