1. Agrawal, A., et al.: VQA: visual question answering. Int. J. Comput. Vision 123, 4–31 (2015)
2. Barra, S., Bisogni, C., De Marsico, M., Ricciardi, S.: Visual question answering: which investigated applications? Pattern Recognit. Lett. 151, 325–331 (2021)
3. Brown, T.B., et al.: Language models are few-shot learners. Adv. Neural. Inf. Process. Syst. 33, 1877–1901 (2020)
4. Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. In: 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, 3–7 May 2021. OpenReview.net (2021). https://openreview.net/forum?id=YicbFdNTTy
5. Freire-Obregón, D., De Marsico, M., Barra, P., Lorenzo-Navarro, J., Castrillón-Santana, M.: Zero-shot ear cross-dataset transfer for person recognition on mobile devices. Pattern Recogn. Lett. 166, 143–150 (2023)