Object-less Vision-language Model on Visual Question Classification for Blind People

Tung Le; Khoa Pho; Thong Bui; Huy Tien Nguyen; Minh Le Nguyen

Conference ProceedingsOPEN ACCESS

Object-less Vision-language Model on Visual Question Classification for Blind People

International Conference on Agents and Artificial Intelligence (2022) 3 180-187

DOI: 10.5220/0010797400003116

4Citations

7Readers

Get full text

Abstract

Despite the long-standing appearance of question types in the Visual Question Answering dataset, Visual Question Classification does not received enough public interest in research. Different from general text classification, a visual question requires an understanding of visual and textual features simultaneously. Together with the enthusiasm and novelty of Visual Question Classification, the most important and practical goal we concentrate on is to deal with the weakness of Object Detection on object-less images. We thus propose an Object-less Visual Question Classification model, OL–LXMERT, to generate virtual objects replacing the dependence of Object Detection in previous Vision-Language systems. Our architecture is effective and powerful enough to digest local and global features of images in understanding the relationship between multiple modalities. Through our experiments in our modified VizWiz-VQC 2020 dataset of blind people, our Object-less LXMERT achieves promising results in the brand-new multi-modal task. Furthermore, the detailed ablation studies show the strength and potential of our model in comparison to competitive approaches.

Author supplied keywords

References Powered by Scopus

View more at Scopus

Cited by Powered by Scopus

View more at Scopus

Cite

CITATION STYLE

APA

Le, T., Pho, K., Bui, T., Nguyen, H. T., & Le Nguyen, M. (2022). Object-less Vision-language Model on Visual Question Classification for Blind People. In International Conference on Agents and Artificial Intelligence (Vol. 3, pp. 180–187). Science and Technology Publications, Lda. https://doi.org/10.5220/0010797400003116

Readers' Seniority

PhD / Post grad / Masters / Doc 2

67%

Researcher 1

33%

Readers' Discipline

Computer Science 2

67%

Arts and Humanities 1

33%

Object-less Vision-language Model on Visual Question Classification for Blind People

Abstract

Author supplied keywords

References Powered by Scopus

Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering

Graph convolutional networks for text classification

Making the V in VQA matter: Elevating the role of image understanding in visual question answering

Cited by Powered by Scopus

Object Detection for the Visually Impaired: A Systematic Literature Review

A Priority Map for Vision-and-Language Navigation with Trajectory Plans and Feature-Location Cues

RVT-Transformer: Residual Attention in Answerability Prediction on Visual Question Answering for Blind People

Register to see more suggestions

Cite

Readers' Seniority

Readers' Discipline