CLIP-Flow: Decoding images encoded in CLIP space

Hao Ma; Ming Li; Jingyuan Yang; Or Patashnik; Dani Lischinski; Daniel Cohen-Or; Hui Huang

Journal ArticleOPEN ACCESS

CLIP-Flow: Decoding images encoded in CLIP space

Computational Visual Media (2024) 10(6) 1157-1168

DOI: 10.1007/s41095-023-0375-z

3Citations

9Readers

Get full text

Abstract

This study introduces CLIP-Flow, a novel network for generating images from a given image or text. To effectively utilize the rich semantics contained in both modalities, we designed a semantics-guided methodology for image- and text-to-image synthesis. In particular, we adopted Contrastive Language-Image Pretraining (CLIP) as an encoder to extract semantics and StyleGAN as a decoder to generate images from such information. Moreover, to bridge the embedding space of CLIP and latent space of StyleGAN, real NVP is employed and modified with activation normalization and invertible convolution. As the images and text in CLIP share the same representation space, text prompts can be fed directly into CLIP-Flow to achieve text-to-image synthesis. We conducted extensive experiments on several datasets to validate the effectiveness of the proposed image-to-image synthesis method. In addition, we tested on the public dataset Multi-Modal CelebA-HQ, for text-to-image synthesis. Experiments validated that our approach can generate high-quality text-matching images, and is comparable with state-of-the-art methods, both qualitatively and quantitatively. (Figure presented.)

Author supplied keywords

Cite

CITATION STYLE

APA

Ma, H., Li, M., Yang, J., Patashnik, O., Lischinski, D., Cohen-Or, D., & Huang, H. (2024). CLIP-Flow: Decoding images encoded in CLIP space. Computational Visual Media, 10(6), 1157–1168. https://doi.org/10.1007/s41095-023-0375-z

CLIP-Flow: Decoding images encoded in CLIP space

Abstract

Author supplied keywords

Cite

Register to see more suggestions