1. Alwassel, H., Mahajan, D., Korbar, B., Torresani, L., Ghanem, B., Tran, D., 2020. Self-Supervised Learning by Cross-Modal Audio-Video Clustering. In: Proc. Adv. Neural Inf. Process. Syst., Vol. 33. pp. 9758–9770.
2. Arandjelovic, R., Zisserman, A., 2017. Look, listen and learn. In: Proc. IEEE Int. Conf. Comput. Vis.. pp. 609–617.
3. Aytar, Y., Vondrick, C., Torralba, A., 2016. SoundNet: Learning sound representations from unlabeled video. In: Proc. Adv. Neural Inf. Process. Syst., Vol. 29. pp. 892–900.
4. See, hear, and read: Deep aligned representations;Aytar,2017
5. Geography-Aware Self-Supervised Learning;Ayush,2020