Fusion or Defusion? Flexible Vision-and-Language Pre-Training

0Citations
Citations of this article
9Readers
Mendeley users who have this article in their library.

Abstract

Existing approaches in the vision-and-language pre-training (VLP) paradigm mainly deploy either fusion-based encoders or dual-encoders, failing to achieve both effectiveness and efficiency in downstream multimodal tasks. In this paper, we build a flexible VLP model by incorporating cross-modal fusions into a dual-encoder architecture, where the introduced fusion modules can be easily decoupled from the dual encoder so as to switch the model to a fusion-free one. To better absorb cross-modal features from the fusion modules, we design a cross-modal knowledge transfer strategy along with other comprehensive pre-training tasks to guide the training process, which can further strengthen both the fusion-based and fusion-free representation learning. Extensive experiments conducted on various downstream vision-language tasks show that our proposed model is well-equipped with effectiveness as well as efficiency, demonstrating a superior performance compared with other strong VLP models.

Cite

CITATION STYLE

APA

Sun, R., Li, Z., Ding, Y., Wang, Q., Wang, J., Zheng, H. T., … Xian, Y. (2023). Fusion or Defusion? Flexible Vision-and-Language Pre-Training. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (pp. 5105–5119). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2023.findings-acl.316

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free