1. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering
2. Stanislaw Antol , Aishwarya Agrawal , Jiasen Lu , Margaret Mitchell , Dhruv Batra , C Lawrence Zitnick , and Devi Parikh . 2015 . Vqa: Visual question answering. In CVPR. 2425--2433. Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. 2015. Vqa: Visual question answering. In CVPR. 2425--2433.
3. Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval
4. Gedas Bertasius Heng Wang and Lorenzo Torresani. 2021. Is Space-Time Attention All You Need for Video Understanding?. In ICML. PMLR 813--824. Gedas Bertasius Heng Wang and Lorenzo Torresani. 2021. Is Space-Time Attention All You Need for Video Understanding?. In ICML. PMLR 813--824.
5. Revisiting the “Video” in Video-Language Understanding