SBVQA 2.0: Robust End-to-End Speech-Based Visual Question Answering for Open-Ended Questions

Faris Alasmary; Saad Al-Ahmadi

Journal ArticleOPEN ACCESS

SBVQA 2.0: Robust End-to-End Speech-Based Visual Question Answering for Open-Ended Questions

IEEE Access (2023) 11 140967-140980

DOI: 10.1109/ACCESS.2023.3339537

0Citations

6Readers

Abstract

Speech-based Visual Question Answering (SBVQA) is a challenging task that aims to answer spoken questions about images. The challenges of this task involve the variability of speakers, the different recording environments, as well as the various objects in the image and their locations. This paper presents SBVQA 2.0, a robust multimodal neural network architecture that integrates information from both the visual and the speech domains. SBVQA 2.0 is composed of four modules: speech encoder, image encoder, features fusor, and answer generator. The speech encoder extracts semantic information from spoken questions, and the image encoder extracts visual information from images. The outputs of the two modules are combined using the features fusor and then processed by the answer generator to predict the answer. Although SBVQA 2.0 was trained on a single-speaker dataset with a clean background, we show that our selected speech encoder is more robust to noise and is speaker-independent. Moreover, we demonstrate that SBVQA 2.0 can be further improved by finetuning in an end-to-end manner since it uses fully differentiable modules. We open-source our pretrained models, source code, and dataset for the research community.

Author supplied keywords

Cite

CITATION STYLE

APA

Alasmary, F., & Al-Ahmadi, S. (2023). SBVQA 2.0: Robust End-to-End Speech-Based Visual Question Answering for Open-Ended Questions. IEEE Access, 11, 140967–140980. https://doi.org/10.1109/ACCESS.2023.3339537

SBVQA 2.0: Robust End-to-End Speech-Based Visual Question Answering for Open-Ended Questions

Abstract

Author supplied keywords

Cite

Register to see more suggestions