Abstract
Understanding texts and generating high-resolution realistic images from text descriptions is a meaningful and challenging topic. Most existing text-to-image generation methods follow the multi-stage generative adversarial network (GAN) framework. Therefore, the image quality of current stages relies heavily on the images generated in the previous stages. Additionally, current text-to-image generators do not so well guarantee a consistent relationship between the text description and the generated image. To address the above issues, we present a novel Text Feature Fusion GAN (TF-GAN) architecture, emphasizing on selecting useful local words and extracting semantic consistent global sentence features. First, useful keywords are extracted in the text description by the proposed sentence fusion attention mechanism (SFAttn) to optimize image features in early stages and provide fine-grained details for the images of the later stage. Second, we propose a new Conditional Fusion Block (CFBlock), which combines multiple non-linear affine transformations to constrain images on a global semantic level. This allows the model to more deeply fuse sentence and image features, leading to more semantically consistent generated images. Extensive experiments and Comparison with other state-of-the-art methods on the Caltech-UCSD Birds 200 (CUB) dataset and the Microsoft Common Objects in Context (COCO) dataset show that our generated images are more photo-realistic and closer to the text descriptions.