Object sequences: Encoding categorical and spatial information for a yes/no visual question answering task

4Citations
Citations of this article
6Readers
Mendeley users who have this article in their library.

Abstract

The task of visual question answering (VQA) has gained wide popularity in recent times. Effectively solving the VQA task requires the understanding of both the visual content in the image and the language information associated with the textbased question. In this study, the authors propose a novel method of encoding the visual information (categorical and spatial object information) of all the objects present in the image into a sequential format, which is called an object sequence. These object sequences can then be suitably processed by a neural network. They experiment with multiple techniques for obtaining a joint embedding from the visual features (in the form of object sequences) and language-based features obtained from the question. They also provide a detailed analysis on the performance of a neural network architecture using object sequences, on the Oracle task of GuessWhat dataset (a Yes/No VQA task) and benchmark it against the baseline.

Cite

CITATION STYLE

APA

Garg, S., & Srivastava, R. (2018). Object sequences: Encoding categorical and spatial information for a yes/no visual question answering task. IET Computer Vision, 12(8), 1141–1150. https://doi.org/10.1049/iet-cvi.2018.5226

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free