1. Self-supervised object detection from audio-visual correspondence
2. VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text;Akbari;NeurIPS,2021
3. Self-Supervised Learning by Cross-Modal Audio-Video Clustering;Alwassel;NeurIPS,2020
4. Look, Listen and Learn
5. Labelling unlabelled videos from scratch with multi-modal self-supervision;Asano;NeurIPS,2020