Multimodal attention for visual question ansswering

0Citations
Citations of this article
6Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Visual Question Answering (VQA) is a task where given an image and a natural language question about the image the aim is to provide an accurate natural language answer. In recent years a lot of work has been done in this area in order to address the challenges that this task presents and improve the accuracy of the models. One of the new concepts that have been recently introduced is the attention mechanism where the model focuses on specific parts of the input in order to generate the answer. In this paper, we present a novel LSTM architecture for VQA that uses multimodal attention to focus over specific parts of the image and also on specific words of the question in order to generate a more precise answer. We evaluate our proposed solution on the VQA dataset and show that it performs better compared with state-of-the-art models.

Cite

CITATION STYLE

APA

Kodra, L., & Meçe, E. K. (2019). Multimodal attention for visual question ansswering. In Advances in Intelligent Systems and Computing (Vol. 858, pp. 783–792). Springer Verlag. https://doi.org/10.1007/978-3-030-01174-1_60

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free