1. Vatt: Transformers for multi-modal self-supervised learning from raw video, audio and text;Akbari;Advances in Neural Information Processing Systems,2021
2. Localizing Moments in Video with Natural Language
3. VQA: Visual Question Answering
4. Is space-time attention all you need for video understanding?;Bertasius;ICML,2021
5. SCA-CNN: Spatial and Channel-Wise Attention in Convolutional Networks for Image Captioning