1. Bottom-up and top-down attention for image captioning and visual question answering;Anderson;IEEE/CVF Conf. Comput. Vis. Pattern Recognit.,2018
2. Flickr30k entities: collecting region-to-phrase correspondences for richer image-to-sentence models;Plummer;IEEE Int. Conf. Comput. Vis. (ICCV),2015
3. T. Xu, P. Zhang, Q. Huang, H. Zhang, Z. Gan, X. Huang, X. He, AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks, in: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Salt Lake City, UT, USA, 2018: pp. 1316–1324. https://doi.org/10.1109/CVPR.2018.00143.
4. Microsoft COCO: Common Objects in Context;Lin,2014
5. F. Faghri, D.J. Fleet, J.R. Kiros, S. Fidler, VSE++: Improving Visual-Semantic Embeddings with Hard Negatives, (2018). https://doi.org/10.48550/arXiv.1707.05612.