Visual Question Answering (VQA) is a task where given an image and a natural language question about the image the aim is to provide an accurate natural language answer. In recent years a lot of work has been done in this area in order to address the challenges that this task presents and improve the accuracy of the models. One of the new concepts that have been recently introduced is the attention mechanism where the model focuses on specific parts of the input in order to generate the answer. In this paper, we present a novel LSTM architecture for VQA that uses multimodal attention to focus over specific parts of the image and also on specific words of the question in order to generate a more precise answer. We evaluate our proposed solution on the VQA dataset and show that it performs better compared with state-of-the-art models.
CITATION STYLE
Kodra, L., & Meçe, E. K. (2019). Multimodal attention for visual question ansswering. In Advances in Intelligent Systems and Computing (Vol. 858, pp. 783–792). Springer Verlag. https://doi.org/10.1007/978-3-030-01174-1_60
Mendeley helps you to discover research relevant for your work.