A Survey on CLIP-Guided Vision-Language Tasks

Zhuoran Yu

Journal ArticleOPEN ACCESS

A Survey on CLIP-Guided Vision-Language Tasks

Yu Z

Highlights in Science, Engineering and Technology (2022) 12 153-159

DOI: 10.54097/hset.v12i.1418

N/ACitations

8Readers

Abstract

Multimodal learning refers to the representation of different modalities using a unified model. Modalities include images, text, audio, etc. In this article, we will first introduce the basic approach of CLIP which is a vision language model with the power of connecting different modalities, and then present different models inspired by CLIP on various downstream tasks. In the end, we conclude with a summary of the prospects and limitations of multimodal learning.

Cite

CITATION STYLE

APA

Yu, Z. (2022). A Survey on CLIP-Guided Vision-Language Tasks. Highlights in Science, Engineering and Technology, 12, 153–159. https://doi.org/10.54097/hset.v12i.1418

A Survey on CLIP-Guided Vision-Language Tasks

Abstract

Cite

Register to see more suggestions