Exploring CLIP for Assessing the Look and Feel of Images

N/ACitations
Citations of this article
76Readers
Mendeley users who have this article in their library.

Abstract

Measuring the perception of visual content is a long-standing problem in computer vision. Many mathematical models have been developed to evaluate the look or quality of an image. Despite the effectiveness of such tools in quantifying degradations such as noise and blurriness levels, such quantification is loosely coupled with human language. When it comes to more abstract perception about the feel of visual content, existing methods can only rely on supervised models that are explicitly trained with labeled data collected via laborious user study. In this paper, we go beyond the conventional paradigms by exploring the rich visual language prior encapsulated in Contrastive Language-Image Pre-training (CLIP) models for assessing both the quality perception (look) and abstract perception (feel) of images without explicit task-specific training. In particular, we discuss effective prompt designs and show an antonym prompt pairing strategy to harness the prior. We also provide extensive experiments on controlled datasets and Image Quality Assessment (IQA) benchmarks. Our results show that CLIP captures meaningful priors that generalize well to different perceptual assessments.

Cite

CITATION STYLE

APA

Wang, J., Chan, K. C. K., & Loy, C. C. (2023). Exploring CLIP for Assessing the Look and Feel of Images. In Proceedings of the 37th AAAI Conference on Artificial Intelligence, AAAI 2023 (Vol. 37, pp. 2555–2563). AAAI Press. https://doi.org/10.1609/aaai.v37i2.25353

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free