1. ALEC RADFORD, JONG WOOK KIM, CHRIS HALLACY, A. RAMESH, GABRIEL GOH, "Learning transferable visual models from natural language supervision." International conference on machine learning. PMLR, 2021: 8748-8763.
2. BIANCO S, CELONA L, DONZELLA M, Improving Image Captioning Descriptiveness by Ranking and LLM-based Fusion. 2023.
3. TSAI Y H, BAI S, LIANG P, Multimodal Transformer for Unaligned Multimodal Language Sequences. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy. 2019: 6558.
4. An all-MLP Architecture for Vision. Neural Information Processing Systems;Neural Information Processing Systems,2021
5. CHEN X, HSIEH C J, GONG B. When Vision Transformers Outperform ResNets without Pre-training or Strong Data Augmentations. Learning, Learning, 2021.