SBVQA 2.0: Robust End-to-End Speech-Based Visual Question Answering for Open-Ended Questions

0Citations
Citations of this article
6Readers
Mendeley users who have this article in their library.

This article is free to access.

Abstract

Speech-based Visual Question Answering (SBVQA) is a challenging task that aims to answer spoken questions about images. The challenges of this task involve the variability of speakers, the different recording environments, as well as the various objects in the image and their locations. This paper presents SBVQA 2.0, a robust multimodal neural network architecture that integrates information from both the visual and the speech domains. SBVQA 2.0 is composed of four modules: speech encoder, image encoder, features fusor, and answer generator. The speech encoder extracts semantic information from spoken questions, and the image encoder extracts visual information from images. The outputs of the two modules are combined using the features fusor and then processed by the answer generator to predict the answer. Although SBVQA 2.0 was trained on a single-speaker dataset with a clean background, we show that our selected speech encoder is more robust to noise and is speaker-independent. Moreover, we demonstrate that SBVQA 2.0 can be further improved by finetuning in an end-to-end manner since it uses fully differentiable modules. We open-source our pretrained models, source code, and dataset for the research community.

Cite

CITATION STYLE

APA

Alasmary, F., & Al-Ahmadi, S. (2023). SBVQA 2.0: Robust End-to-End Speech-Based Visual Question Answering for Open-Ended Questions. IEEE Access, 11, 140967–140980. https://doi.org/10.1109/ACCESS.2023.3339537

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free