1. Flamingo: a visual language model for few-shot learning;Alayrac,2022
2. Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zitnick, C.L., Parikh, D., 2015. Vqa: Visual question answering. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 2425–2433.
3. Foundational models in medical imaging: A comprehensive survey and future vision;Azad,2023
4. Biten, A.F., Gomez, L., Rusinol, M., Karatzas, D., 2019. Good news, everyone! context driven entity-aware captioning for news images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 12466–12475.
5. Food-101 – mining discriminative components with random forests;Bossard,2014