1. 400m: Open dataset of clip-filtered 400 million image-text pairs. arxiv 2021;Schuhmann
2. Position-Guided Text Prompt for Vision-Language Pre-Training
3. Coca: Contrastive captioners are image-text foundation models;Yu,2022
4. Mammut: A simple architecture for joint learning for multimodal tasks;Kuo,2023
5. Flamingo: a visual language model for few-shot learning;Alayrac;Adv. NeurIPS,2022