Optimizing and interpreting the latent space of the conditional text-to-image GANs-Reference-Cited by-同舟云学术

Optimizing and interpreting the latent space of the conditional text-to-image GANs

Published:2023-11-21 Issue:5 Volume:36 Page:2549-2572
ISSN:0941-0643
Container-title:Neural Computing and Applications
language:en
Short-container-title:Neural Comput & Applic

Author:

Zhang Zhenxing^ORCID,Schomaker Lambert

Abstract

AbstractText-to-image generation intends to automatically produce a photo-realistic image, conditioned on a textual description. To facilitate the real-world applications of text-to-image synthesis, we focus on studying the following three issues: (1) How to ensure that generated samples are believable, realistic or natural? (2) How to exploit the latent space of the generator to edit a synthesized image? (3) How to improve the explainability of a text-to-image generation framework? We introduce two new data sets for benchmarking, i.e., the Good & Bad, bird and face, data sets consisting of successful as well as unsuccessful generated samples. This data set can be used to effectively and efficiently acquire high-quality images by increasing the probability of generating Good latent codes with a separate, new classifier. Additionally, we present a novel algorithm which identifies semantically understandable directions in the latent space of a conditional text-to-image GAN architecture by performing independent component analysis on the pre-trained weight values of the generator. Furthermore, we develop a background-flattening loss (BFL), to improve the background appearance in the generated images. Subsequently, we introduce linear-interpolation analysis between pairs of text keywords. This is extended into a similar triangular ‘linguistic’ interpolation. The visual array of interpolation results gives users a deep look into what the text-to-image synthesis model has learned within the linguistic embeddings. Experimental results on the recent DiverGAN generator, pre-trained on three common benchmark data sets demonstrate that our classifier achieves a better than 98% accuracy in predicting Good/Bad classes for synthetic samples and our proposed approach is able to derive various interpretable semantic properties for the text-to-image GAN model.

Publisher

Springer Science and Business Media LLC

Subject

Artificial Intelligence,Software

Link

https://link.springer.com/content/pdf/10.1007/s00521-023-09185-6.pdf

Reference67 articles.

1. Mirza M, Osindero S (2014) Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784

2. Zhang Z, Schomaker L (2022) Optgan: Optimizing and interpreting the latent space of the conditional text-to-image gans. arXiv preprint arXiv:2202.12929

3. Zhang Z, Schomaker L (2021) Divergan: an efficient and effective single-stage framework for diverse text-to-image generation. Neurocomputing 473:182–198

4. Wah C, Branson S, Welinder P, Perona P, Belongie S (2011) The Caltech-UCSD Birds-200-2011 Dataset. Technical Report CNS-TR-2011-001, California Institute of Technology

5. Xia W, Yang Y, Xue J-H, Wu B (2021) Tedigan: text-guided diverse face image generation and manipulation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2256–2265

Cited by 3 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Foundations of Generative AI;Advances in Computational Intelligence and Robotics;2024-06-28

2. Semantic Distance Adversarial Learning for Text-to-Image Synthesis;IEEE Transactions on Multimedia;2024

3. Optimizing and interpreting the latent space of the conditional text-to-image GANs;Neural Computing and Applications;2023-11-21