1. Learning transferable visual models from natural language supervision;Radford
2. Coca: Contrastive captioners are image-text foundation models;Yu,2022
3. Blip: Bootstrapping language-image pre-training for unified visionlanguage understanding and generation;Li
4. Scaling up visual and vision-language representation learning with noisy text supervision;Jia
5. Align before fuse: Vision and language representation learning with momentum distillation;Li;Advances in neural information processing systems,2021