CLIP-Flow: Decoding images encoded in CLIP space

3Citations
Citations of this article
9Readers
Mendeley users who have this article in their library.
Get full text

Abstract

This study introduces CLIP-Flow, a novel network for generating images from a given image or text. To effectively utilize the rich semantics contained in both modalities, we designed a semantics-guided methodology for image- and text-to-image synthesis. In particular, we adopted Contrastive Language-Image Pretraining (CLIP) as an encoder to extract semantics and StyleGAN as a decoder to generate images from such information. Moreover, to bridge the embedding space of CLIP and latent space of StyleGAN, real NVP is employed and modified with activation normalization and invertible convolution. As the images and text in CLIP share the same representation space, text prompts can be fed directly into CLIP-Flow to achieve text-to-image synthesis. We conducted extensive experiments on several datasets to validate the effectiveness of the proposed image-to-image synthesis method. In addition, we tested on the public dataset Multi-Modal CelebA-HQ, for text-to-image synthesis. Experiments validated that our approach can generate high-quality text-matching images, and is comparable with state-of-the-art methods, both qualitatively and quantitatively. (Figure presented.)

Cite

CITATION STYLE

APA

Ma, H., Li, M., Yang, J., Patashnik, O., Lischinski, D., Cohen-Or, D., & Huang, H. (2024). CLIP-Flow: Decoding images encoded in CLIP space. Computational Visual Media, 10(6), 1157–1168. https://doi.org/10.1007/s41095-023-0375-z

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free