1. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering
2. Dzmitry Bahdanau , Kyunghyun Cho , and Yoshua Bengio . 2014 . Neural Machine Translation by Jointly Learning to Align and Translate . In Proc. International Conference on Learning Representations , Vol. abs/ 1409 .0473. Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural Machine Translation by Jointly Learning to Align and Translate. In Proc. International Conference on Learning Representations, Vol. abs/1409.0473.
3. Meshed-Memory Transformer for Image Captioning
4. Long-term recurrent convolutional networks for visual recognition and description
5. Alexey Dosovitskiy , Lucas Beyer , Alexander Kolesnikov , Dirk Weissenborn , Xiaohua Zhai , Thomas Unterthiner , Mostafa Dehghani , Matthias Minderer , Georg Heigold , Sylvain Gelly , Jakob Uszkoreit , and Neil Houlsby . 2020 . An image is worth 16x16 words: Transformers for image recognition at scale . In Proc. International Conference on Learning Representations , Vol. abs/ 2010 .11929. Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. In Proc. International Conference on Learning Representations, Vol. abs/2010.11929.