1. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A.N. Gomez, Ł. Kaiser, I. Polosukhin, Attention is all you need, in: NIPS, 2017, pp. 5998–6008.
2. V. Sanh, L. Debut, J. Chaumond, T. Wolf, DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter, Workshop on EMC2, NeurIPS.
3. S. Herdade, A. Kappeler, K. Boakye, J. Soares, Image captioning: Transforming objects into words, in: NeurIPS, 2019, pp. 11137–11147.
4. M. Cornia, M. Stefanini, L. Baraldi, R. Cucchiara, Meshed-Memory Transformer for Image Captioning, in: IEEE/CVF CVPR, 2020, pp. 10578–10587.
5. P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, L. Zhang, Bottom-up and top-down attention for image captioning and visual question answering, in: IEEE CVPR, 2018, pp. 6077–6086.