Optimizing and interpreting the latent space of the conditional text-to-image GANs

1Citations
Citations of this article
20Readers
Mendeley users who have this article in their library.

Abstract

Text-to-image generation intends to automatically produce a photo-realistic image, conditioned on a textual description. To facilitate the real-world applications of text-to-image synthesis, we focus on studying the following three issues: (1) How to ensure that generated samples are believable, realistic or natural? (2) How to exploit the latent space of the generator to edit a synthesized image? (3) How to improve the explainability of a text-to-image generation framework? We introduce two new data sets for benchmarking, i.e., the Good & Bad, bird and face, data sets consisting of successful as well as unsuccessful generated samples. This data set can be used to effectively and efficiently acquire high-quality images by increasing the probability of generating Good latent codes with a separate, new classifier. Additionally, we present a novel algorithm which identifies semantically understandable directions in the latent space of a conditional text-to-image GAN architecture by performing independent component analysis on the pre-trained weight values of the generator. Furthermore, we develop a background-flattening loss (BFL), to improve the background appearance in the generated images. Subsequently, we introduce linear-interpolation analysis between pairs of text keywords. This is extended into a similar triangular ‘linguistic’ interpolation. The visual array of interpolation results gives users a deep look into what the text-to-image synthesis model has learned within the linguistic embeddings. Experimental results on the recent DiverGAN generator, pre-trained on three common benchmark data sets demonstrate that our classifier achieves a better than 98% accuracy in predicting Good/Bad classes for synthetic samples and our proposed approach is able to derive various interpretable semantic properties for the text-to-image GAN model.

Cite

CITATION STYLE

APA

Zhang, Z., & Schomaker, L. (2024). Optimizing and interpreting the latent space of the conditional text-to-image GANs. Neural Computing and Applications, 36(5), 2549–2572. https://doi.org/10.1007/s00521-023-09185-6

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free