Author:
Yuan Haofei,Zhu Hongqing,Yang Suyi,Wang Ziying,Wang Nan
Abstract
AbstractThe text-to-image (T2I) model based on a single-stage generative adversarial network (GAN) has significantly succeeded in recent years. However, the generation model based on GAN has two disadvantages: the generator does not introduce any image feature manifold structure, which makes it challenging to align the image and text features. Another is the image’s diversity; the text’s abstraction will prevent the model from learning the actual image distribution. This paper proposes a reversed image interaction generative adversarial network (RII-GAN), which consists of four components: text encoder, reversed image interaction network (RIIN), adaptive affine-based generator, and dual-channel feature alignment discriminator (DFAD). RIIN indirectly introduces the actual image distribution into the generation network, thus overcoming the problem that the network lacks the learning of the actual image feature manifold structure and generating the distribution of text-matching images. Each adaptive affine block (AAB) in the proposed affine-based generator can adaptively enhance text information, establishing an updated relation between original independent fusion blocks and the image feature. Moreover, this study designs a DFAD to capture important feature information of images and text in two channels. Such a dual-channel backbone improves semantic consistency by utilizing a particular synchronized bi-modal information extraction structure. We have performed experiments on publicly available datasets to prove the effectiveness of our model.
Funder
National Natural Science Foundation of China
Publisher
Springer Science and Business Media LLC
Reference41 articles.
1. Arjovsky M, Chintala S, Bottou L (2017) Wasserstein generative adversarial networks. In: Proceedings of the 34th International Conference on Machine Learning, pp 214–223
2. Berthelot D, Schumm T, Metz L (2017) BEGAN: boundary equilibrium generative adversarial networks. arXiv preprint arXiv:1703.10717
3. Chen Z, Luo Y (2019) Cycle-consistent diverse image synthesis from natural language. In: 2019 IEEE international conference on multimedia & expo workshops (ICMEW), IEEE, pp 459–464
4. Dosovitskiy A, Beyer L, Kolesnikov A, et al (2020) An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929
5. Goodfellow I, Pouget-Abadie J, Mirza M et al (2020) Generative adversarial networks. Commun ACM 63(11):139–144