A Framework for Vision-Language Warm-up Tasks in Multimodal Dialogue Models

3Citations
Citations of this article
12Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Most research on multimodal open-domain dialogue agents has focused on pretraining and multi-task learning using additional rich datasets beyond a given target dataset. However, methods for exploiting these additional datasets can be quite limited in real-world settings, creating a need for more efficient methods for constructing agents based solely on the target dataset. To address these issues, we present a new learning strategy called vision-language warm-up tasks for multimodal dialogue models (VLAW-MDM). This strategy does not require the use of large pretraining or multi-task datasets but rather relies solely on learning from target data. Moreover, our proposed approach automatically generates captions for images and incorporates them into the model's input to improve the contextualization of visual information. Using this novel approach, we empirically demonstrate that our learning strategy is effective for limited data and relatively small models. The result show that our method achieved comparable and in some cases superior performance compared to existing state-of-the-art models on various evaluation metrics. The code is available at https://github.com/BeneciaLee/VLAW-MDM.

Cite

CITATION STYLE

APA

Lee, J., Park, S., Park, S. H., Kim, H., & Kim, H. (2023). A Framework for Vision-Language Warm-up Tasks in Multimodal Dialogue Models. In EMNLP 2023 - 2023 Conference on Empirical Methods in Natural Language Processing, Proceedings (pp. 2789–2799). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2023.emnlp-main.167

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free