Unsupervised Vision-and-Language Pre-training Without Parallel Images and Captions

27Citations
Citations of this article
155Readers
Mendeley users who have this article in their library.

Abstract

Pre-trained contextual vision-and-language (V&L) models have achieved impressive performance on various benchmarks. However, existing models require a large amount of parallel image-caption data for pre-training. Such data are costly to collect and require cumbersome curation. Inspired by unsupervised machine translation, we investigate if a strong V&L representation model can be learned through unsupervised pre-training without image-caption corpora. In particular, we propose to conduct “mask-and-predict” pre-training on text-only and image-only corpora and introduce the object tags detected by an object recognition model as anchor points to bridge two modalities. We find that such a simple approach achieves performance close to a model pre-trained with aligned data, on four English V&L benchmarks. Our work challenges the widely held notion that aligned data is necessary for V&L pre-training, while significantly reducing the amount of supervision needed for V&L models.

Cite

CITATION STYLE

APA

Li, L. H., You, H., Wang, Z., Zareian, A., Chang, S. F., & Chang, K. W. (2021). Unsupervised Vision-and-Language Pre-training Without Parallel Images and Captions. In NAACL-HLT 2021 - 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Proceedings of the Conference (pp. 5339–5350). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2021.naacl-main.420

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free