1. Akula A, Changpinyo S, Gong B et al (2021) Crossvqa: scalably generating benchmarks for systematically testing vqa generalization. Proc Conf Empir Methods Nat Lang Process 2021:2148–2166
2. Anderson P, He X, Buehler C et al (2018) Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6077–6086
3. Antol S, Agrawal A, Lu J, et al (2015) Vqa: Visual question answering. In: Proceedings of the IEEE international conference on computer vision, pp 2425–2433
4. Berrios W, Mittal G, Thrush T et al (2023) Towards language models that can see: computer vision through the LENS of natural language. arXiv preprint arXiv:2306.16410
5. Brown T, Mann B, Ryder N et al (2020) Language models are few-shot learners. Adv Neural Inf Process Syst 33:1877–1901