1. An image is worth 16×16 words: Transformers for image recognition at scale;dosovitskiy;International Conference on Learning Representations,2021
2. Long-Term Recurrent Convolutional Networks for Visual Recognition and Description
3. CapWAP: Image captioning with a pur-pose;fisch;Proceedings of EMNLP,2020
4. Communication break-down: On the low mutual intelligibility between human and neural captioning;dessì;Proceedings of EMNLP,2022