1. Vatt: Transformers for multimodal self-supervised learning from raw video, audio and text;Akbari;NeurIPS,2021
2. Wasserstein gan;Arjovsky,2017
3. ViViT: A Video Vision Transformer
4. Layer normalization;Ba,2016
5. Is space-time attention all you need for video understanding?;Bertasius