Author:
Fragomeni Adriano,Wray Michael,Damen Dima
Publisher
Springer Nature Switzerland
Reference72 articles.
1. Akbari, H., et al.: VATT: Transformers for multimodal self-supervised learning from raw video, audio and text. In: Conference on Neural Information Processing Systems (NeurIPS) (2021)
2. Alayrac, J., et al.: Self-supervised multimodal versatile networks. In: Conference on Neural Information Processing Systems (NeurIPS) (2020)
3. Bain, M., Nagrani, A., Varol, G., Zisserman, A.: Frozen in time: a joint video and image encoder for end-to-end retrieval. In: International Conference on Computer Vision (ICCV) (2021)
4. Beery, S., Wu, G., Rathod, V., Votel, R., Huang, J.: Context R-CNN: long term temporal context for per-camera object detection. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2020)
5. Bertasius, G., Torresani, L.: Classifying, segmenting, and tracking object instances in video with mask propagation. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2020)