Semantic reanalysis of scene words in visual question answering

0Citations
Citations of this article
2Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Visual Question Answering (VQA) is a joint task that aims to answer questions based on the given images. The correct analysis of multiple album aggregate issues to remain a key issue in the VQA case, especially when answering question from multiple albums, how to correctly understand album images and corresponding question is an urgent problem. Under the influence of multiple photo albums and the presence of scene words in the question, it may lead to understanding the wrong scene and outputting the wrong answer, resulting in a decrease in VQA performance. In order to solve this problem, this paper proposes a new image and sentence similarity matching model, which outputs the correct image representation by learning the semantic concept. Due to the scene word is not an entity, sometimes the information which the model extracted may be incorrect. Therefore, we can try to reanalyse the question in another different way and give the answer by the similarity between the question and the visual-text. Our model was tested on the MemexQA dataset. The experimental results show that our model not only produces meaningful text sentences to prove the correctness of the answer, but also improves the accuracy by nearly 10%.

Cite

CITATION STYLE

APA

Jiang, S., Ma, M., Wang, J., Liang, J., Liu, K., Sun, Y., … Jin, G. (2019). Semantic reanalysis of scene words in visual question answering. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 11857 LNCS, pp. 468–479). Springer. https://doi.org/10.1007/978-3-030-31654-9_40

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free