1. Alwassel, H., Mahajan, D., Torresani, L., Ghanem, B., Tran, D.: Self-supervised learning by cross-modal audio-video clustering. In: NeurIPS (2020)
2. Lecture Notes in Computer Science;Y Bai,2020
3. Bao, H., Dong, L., Wei, F.: Beit: bert pre-training of image transformers. arXiv preprint. arXiv:2106.08254 (2021)
4. Benaim, S., et al.: SpeedNet: learning the speediness in videos. In: CVPR (2020)
5. Berthelot, D., Carlini, N., Goodfellow, I., Papernot, N., Oliver, A., Raffel, C.: Mixmatch: a holistic approach to semi-supervised learning. In: NeurIPS (2019)