1. Vatt: Transformers for multimodal self-supervised learning from raw video, audio and text;Akbari;Advances in Neural Information Processing Systems,2021
2. Objects that Sound
3. 3D Semantic Parsing of Large-Scale Indoor Spaces
4. ViViT: A Video Vision Transformer
5. Layer normalization;Ba,2016