Self-Supervised Learning for Videos: A Survey-Reference-Cited by-同舟云学术

Self-Supervised Learning for Videos: A Survey

Published:2022-12-21 Issue: Volume: Page:
ISSN:0360-0300
Container-title:ACM Computing Surveys
language:en
Short-container-title:ACM Comput. Surv.

Author:

Schiappa Madeline C.,Rawat Yogesh S.,Shah Mubarak¹

Affiliation:

1. Center for Research in Computer Vision, University of Central Florida, USA

Abstract

The remarkable success of deep learning in various domains relies on the availability of large-scale annotated datasets. However, obtaining annotations is expensive and requires great effort, which is especially challenging for videos. Moreover, the use of human-generated annotations leads to models with biased learning and poor domain generalization and robustness. As an alternative, self-supervised learning provides a way for representation learning which does not require annotations and has shown promise in both image and video domains. Different from the image domain, learning video representations are more challenging due to the temporal dimension, bringing in motion and other environmental dynamics. This also provides opportunities for video-exclusive ideas that advance self-supervised learning in the video and multimodal domain. In this survey, we provide a review of existing approaches on self-supervised learning focusing on the video domain. We summarize these methods into four different categories based on their learning objectives: 1) pretext tasks , 2) generative learning , 3) contrastive learning , and 4) cross-modal agreement . We further introduce the commonly used datasets, downstream evaluation tasks, insights into the limitations of existing works, and the potential future directions in this area.

Publisher

Association for Computing Machinery (ACM)

Subject

General Computer Science,Theoretical Computer Science

Link

https://dl.acm.org/doi/pdf/10.1145/3577925

Reference217 articles.

1. Sami Abu-El-Haija Nisarg Kothari Joonseok Lee Paul Natsev George Toderici Balakrishnan Varadarajan and Sudheendra Vijayanarasimhan. 2016. YouTube-8M: A Large-Scale Video Classification Benchmark. arxiv:1609.08675 [cs.CV] Sami Abu-El-Haija Nisarg Kothari Joonseok Lee Paul Natsev George Toderici Balakrishnan Varadarajan and Sudheendra Vijayanarasimhan. 2016. YouTube-8M: A Large-Scale Video Classification Benchmark. arxiv:1609.08675 [cs.CV]

2. Triantafyllos Afouras , Andrew Owens , Joon Son Chung , and Andrew Zisserman . 2020. Self-supervised Learning of Audio-Visual Objects from Video. Vol. 12363 LNCS . Springer International Publishing . 208–224 pages. Triantafyllos Afouras, Andrew Owens, Joon Son Chung, and Andrew Zisserman. 2020. Self-supervised Learning of Audio-Visual Objects from Video. Vol. 12363 LNCS. Springer International Publishing. 208–224 pages.

3. Unaiza Ahsan , Rishi Madhok , and Irfan Essa . 2019 . Video jigsaw: Unsupervised learning of spatiotemporal context for video action recognition . In 2019 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, WACV, 179–189 . Unaiza Ahsan, Rishi Madhok, and Irfan Essa. 2019. Video jigsaw: Unsupervised learning of spatiotemporal context for video action recognition. In 2019 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, WACV, 179–189.

4. Hassan Akbari , Linagzhe Yuan , Rui Qian , Wei-Hong Chuang , Shih-Fu Chang , Yin Cui , and Boqing Gong . 2021 . Vatt: Transformers for multimodal self-supervised learning from raw video, audio and text. In Generating videos with scene dynamics (NeurIPs). Hassan Akbari, Linagzhe Yuan, Rui Qian, Wei-Hong Chuang, Shih-Fu Chang, Yin Cui, and Boqing Gong. 2021. Vatt: Transformers for multimodal self-supervised learning from raw video, audio and text. In Generating videos with scene dynamics (NeurIPs).

5. Jean-Baptiste Alayrac , Adria Recasens , Rosalia Schneider , Relja Arandjelović , Jason Ramapuram , Jeffrey De Fauw , Lucas Smaira , Sander Dieleman , and Andrew Zisserman . 2020. Self-supervised multimodal versatile networks. Generating videos with scene dynamics (NeurIPs) 33 ( 2020 ), 25–37. Jean-Baptiste Alayrac, Adria Recasens, Rosalia Schneider, Relja Arandjelović, Jason Ramapuram, Jeffrey De Fauw, Lucas Smaira, Sander Dieleman, and Andrew Zisserman. 2020. Self-supervised multimodal versatile networks. Generating videos with scene dynamics (NeurIPs) 33 (2020), 25–37.

Cited by 47 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Exploring simple triplet representation learning;Computational and Structural Biotechnology Journal;2024-12

2. Rethinking samples selection for contrastive learning: Mining of potential samples;Knowledge-Based Systems;2024-09

3. Self-Supervised Denoising through Independent Cascade Graph Augmentation for Robust Social Recommendation;Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining;2024-08-24

4. Integrating pseudo labeling with contrastive clustering for transformer-based semi-supervised action recognition;Applied Intelligence;2024-08-10

5. Device Selection Methods in Federated Learning: A Survey;SN Computer Science;2024-08-02