Visual textbook network: Watch carefully before answering visual questions

Difei Gao; Ruiping Wang; Shiguang Shan; Xilin Chen

Conference ProceedingsOPEN ACCESS

Visual textbook network: Watch carefully before answering visual questions

British Machine Vision Conference 2017, BMVC 2017 (2017)

DOI: 10.5244/c.31.131

0Citations

9Readers

Abstract

Recent deep neural networks have achieved promising results on Visual Question Answering (VQA) tasks. However, many works have shown that a high accuracy does not always guarantee that the VQA system correctly understands the contents of images and questions, which are what we really care about. Attention based models can locate the regions related to answers, and may demonstrate a promising understanding of image and question. However, the key components of generating correct location, i.e. visual semantic alignments and semantic reasoning, are still obscure and invisible. To deal with this problem, we introduce a two-stage model Visual Textbook Network (VTN), which is made up by two modules to produce more reasonable answers. Specifically, in the first stage, a textbook module watches the image carefully by performing a novel task named sentence reconstruction, which encodes a word to a visual region feature, and then decodes the visual feature to the input word. This procedure forces VTN to learn visual semantic alignments without much concerning on question answering. This stage is just like studying from textbooks where people mainly concentrate on the knowledge in the book and pay little attention to the test. At the second stage, we propose a simple network as exam module, which utilizes both the visual features generated by the first module and the question to predict the answer. To validate the effectiveness of our method, we conduct evaluations on Visual7W dataset and show the quantitive and qualitative results on answering questions. We also perform the ablation studies to further confirm the effectiveness of the individual textbook and exam modules.

Cite

CITATION STYLE

APA

Gao, D., Wang, R., Shan, S., & Chen, X. (2017). Visual textbook network: Watch carefully before answering visual questions. In British Machine Vision Conference 2017, BMVC 2017. BMVA Press. https://doi.org/10.5244/c.31.131

Visual textbook network: Watch carefully before answering visual questions

Abstract

Cite

Register to see more suggestions