1. Afouras, T., Owens, A., Chung, J.S., Zisserman, A., 2020. Self-supervised learning of audio-visual objects from video. In: The European Conference on Computer Vision. ECCV.
2. Video jigsaw: Unsupervised learning of spatiotemporal context for video action recognition;Ahsan,2019
3. Self-supervised learning by cross-modal audio-video clustering;Alwassel,2020
4. Bachman, P., Hjelm, R.D., Buchwalter, W., 2019. Learning representations by maximizing mutual information across views. In: Advances in Neural Information Processing Systems. pp. 15535–15545.
5. Can temporal information help with contrastive self-supervised learning?;Bai,2020