VLIS: Unimodal Language Models Guide Multimodal Language Generation

1Citations
Citations of this article
23Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Multimodal language generation, which leverages the synergy of language and vision, is a rapidly expanding field. However, existing vision-language models face challenges in tasks that require complex linguistic understanding. To address this issue, we introduce Visual-Language models as Importance Sampling weights (VLIS), a novel framework that combines the visual conditioning capability of vision-language models with the language understanding of unimodal text-only language models without further training. It extracts pointwise mutual information of each image and text from a visual-language model and uses the value as an importance sampling weight to adjust the token likelihood from a text-only model. VLIS improves vision-language models on diverse tasks, including commonsense understanding (WHOOPS, OKVQA, and ScienceQA) and complex text generation (Concadia, Image Paragraph Captioning, and ROCStories). Our results suggest that VLIS represents a promising new direction for multimodal language generation.

Cite

CITATION STYLE

APA

Chung, J., & Yu, Y. (2023). VLIS: Unimodal Language Models Guide Multimodal Language Generation. In EMNLP 2023 - 2023 Conference on Empirical Methods in Natural Language Processing, Proceedings (pp. 700–721). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2023.emnlp-main.46

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free