Multimodal Prompt Retrieval for Generative Visual Question Answering

ISSN: 0736587X
2Citations
Citations of this article
14Readers
Mendeley users who have this article in their library.

Abstract

Recent years have witnessed impressive results of pre-trained vision-language models on knowledge-intensive tasks such as visual question answering (VQA). Despite the recent advances in VQA, existing methods mainly adopt a discriminative formulation that predicts answers within a pre-defined label set, leading to easy overfitting on low-resource domains with limited labeled data (e.g., medicine) and poor generalization under domain shift to another dataset. To tackle this limitation, we propose a novel generative model enhanced by multimodal prompt retrieval (MPR) that integrates retrieved prompts and multimodal features to generate answers in free text. Our generative model enables rapid zero-shot dataset adaptation to unseen data distributions and open-set answer labels across datasets. Our experiments on medical VQA tasks show that MPR outperforms its non-retrieval counterpart by up to 30% accuracy points in a few-shot domain adaptation setting.

Cite

CITATION STYLE

APA

Ossowski, T., & Hu, J. (2023). Multimodal Prompt Retrieval for Generative Visual Question Answering. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (pp. 2518–2535). Association for Computational Linguistics (ACL).

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free