Multimodal learning refers to the representation of different modalities using a unified model. Modalities include images, text, audio, etc. In this article, we will first introduce the basic approach of CLIP which is a vision language model with the power of connecting different modalities, and then present different models inspired by CLIP on various downstream tasks. In the end, we conclude with a summary of the prospects and limitations of multimodal learning.
CITATION STYLE
Yu, Z. (2022). A Survey on CLIP-Guided Vision-Language Tasks. Highlights in Science, Engineering and Technology, 12, 153–159. https://doi.org/10.54097/hset.v12i.1418
Mendeley helps you to discover research relevant for your work.