A Survey on CLIP-Guided Vision-Language Tasks

  • Yu Z
N/ACitations
Citations of this article
8Readers
Mendeley users who have this article in their library.

Abstract

Multimodal learning refers to the representation of different modalities using a unified model. Modalities include images, text, audio, etc. In this article, we will first introduce the basic approach of CLIP which is a vision language model with the power of connecting different modalities, and then present different models inspired by CLIP on various downstream tasks. In the end, we conclude with a summary of the prospects and limitations of multimodal learning.

Cite

CITATION STYLE

APA

Yu, Z. (2022). A Survey on CLIP-Guided Vision-Language Tasks. Highlights in Science, Engineering and Technology, 12, 153–159. https://doi.org/10.54097/hset.v12i.1418

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free