1. VATT: Transformers for multimodal self-supervised learning from raw video, audio and text;Akbari
2. Self-supervised multimodal versatile networks;Alayrac
3. Self-supervised learning by cross-modal audio-video clustering;Alwassel
4. Look, Listen and Learn
5. Objects that Sound