1. A. Radford, J.W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al., Learning transferable visual models from natural language supervision, in: International Conference on Machine Learning, 2021.
2. Clip-adapter: Better vision-language models with feature adapters;Gao,2021
3. Learning to prompt for vision-language models;Zhou;Int. J. Comput. Vis.,2022
4. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks;Lu,2019
5. J. Duan, L. Chen, S. Tran, J. Yang, Y. Xu, B. Zeng, T. Chilimbi, Multi-modal alignment using representation codebook, in: IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 15651–15660.