Multimodal attention for visual question ansswering

Lorena Kodra; Elinda Kajo Meçe

Conference Proceedings

Multimodal attention for visual question ansswering

Advances in Intelligent Systems and Computing (2019) 858 783-792

DOI: 10.1007/978-3-030-01174-1_60

0Citations

6Readers

Get full text

Abstract

Visual Question Answering (VQA) is a task where given an image and a natural language question about the image the aim is to provide an accurate natural language answer. In recent years a lot of work has been done in this area in order to address the challenges that this task presents and improve the accuracy of the models. One of the new concepts that have been recently introduced is the attention mechanism where the model focuses on specific parts of the input in order to generate the answer. In this paper, we present a novel LSTM architecture for VQA that uses multimodal attention to focus over specific parts of the image and also on specific words of the question in order to generate a more precise answer. We evaluate our proposed solution on the VQA dataset and show that it performs better compared with state-of-the-art models.

Author supplied keywords

Cite

CITATION STYLE

APA

Kodra, L., & Meçe, E. K. (2019). Multimodal attention for visual question ansswering. In Advances in Intelligent Systems and Computing (Vol. 858, pp. 783–792). Springer Verlag. https://doi.org/10.1007/978-3-030-01174-1_60

Multimodal attention for visual question ansswering

Abstract

Author supplied keywords

Cite

Register to see more suggestions