1. ViViT: A Video Vision Transformer
2. Is space-time attention all you need for video understanding?;Bertasius;ICML,2021
3. Token merging: Your vit but faster;Bolya
4. Space-time mixing attention for video transformer;Bulat;Advances in Neural Information Processing Systems,2021
5. Generative Semantic Segmentation