1. Peter Anderson, Basura Fernando, Mark Johnson, and Stephen Gould. 2016. SPICE: Semantic Propositional Image Caption Evaluation. ArXiv abs/1607.08822 (2016).
2. Martín Arjovsky, Soumith Chintala, and Léon Bottou. 2017. Wasserstein Generative Adversarial Networks. In International Conference on Machine Learning.
3. Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. 2023. Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities. ArXiv abs/2308.12966 (2023).
4. Fan Bao, Shen Nie, Kaiwen Xue, Yue Cao, Chongxuan Li, Hang Su, and Jun Zhu. 2022. All are Worth Words: A ViT Backbone for Diffusion Models. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2022), 22669--22679.
5. James Betker Gabriel Goh Li Jing ? Tim Brooks Jianfeng Wang Linjie Li ? LongOuyang ? Juntang Zhuang ? Joyce Lee ? Yufei Guo ? Wesam Manassra ? Prafulla Dhariwal ? Casey Chu ? Yunxin Jiao and Aditya Ramesh. [n. d.]. Improving Image Generation with Better Captions.