Achieving Human Parity on Visual Question Answering

Ming Yan; Haiyang Xu; Chenliang Li; Junfeng Tian; Bin Bi; Wei Wang; Xianzhe Xu; Ji Zhang; Songfang Huang; Fei Huang; Luo Si; Rong Jin

Journal ArticleOPEN ACCESS

Achieving Human Parity on Visual Question Answering

Yan M
Xu H
Li C
et al.

ACM Transactions on Information Systems (2023) 41(3)

DOI: 10.1145/3572833

10Citations

24Readers

Get full text

Abstract

The Visual Question Answering (VQA) task utilizes both visual image and language analysis to answer a textual question with respect to an image. It has been a popular research topic with an increasing number of real-world applications in the last decade. This paper introduces a novel hierarchical integration of vision and language AliceMind-MMU (ALIbaba's Collection of Encoder-decoders from Machine IntelligeNce lab of Damo academy - MultiMedia Understanding), which leads to similar or even slightly better results than a human being does on VQA. A hierarchical framework is designed to tackle the practical problems of VQA in a cascade manner including: (1) diverse visual semantics learning for comprehensive image content understanding; (2) enhanced multi-modal pre-training with modality adaptive attention; and (3) a knowledge-guided model integration with three specialized expert modules for the complex VQA task. Treating different types of visual questions with corresponding expertise needed plays an important role in boosting the performance of our VQA architecture up to the human level. An extensive set of experiments and analysis are conducted to demonstrate the effectiveness of the new research work.

Author supplied keywords

Cite

CITATION STYLE

APA

Yan, M., Xu, H., Li, C., Tian, J., Bi, B., Wang, W., … Jin, R. (2023). Achieving Human Parity on Visual Question Answering. ACM Transactions on Information Systems, 41(3). https://doi.org/10.1145/3572833

Achieving Human Parity on Visual Question Answering

Abstract

Author supplied keywords

Cite

Register to see more suggestions