1. Elad Amrani , Rami Ben-Ari , Daniel Rotman , and Alex Bronstein . 2020. Noise estimation using density estimation for self-supervised multimodal learning. arXiv preprint arXiv:2003.03186 8 ( 2020 ). Elad Amrani, Rami Ben-Ari, Daniel Rotman, and Alex Bronstein. 2020. Noise estimation using density estimation for self-supervised multimodal learning. arXiv preprint arXiv:2003.03186 8 (2020).
2. Localizing Moments in Video with Natural Language
3. Anurag Arnab , Mostafa Dehghani , Georg Heigold , Chen Sun , Mario Lucic , and Cordelia Schmid . 2021 . Vivit: A video vision transformer. arXiv preprint arXiv:2103.15691 (2021). Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lucic, and Cordelia Schmid. 2021. Vivit: A video vision transformer. arXiv preprint arXiv:2103.15691 (2021).
4. Max Bain , Arsha Nagrani , Gül Varol , and Andrew Zisserman . 2021. Frozen in time: A joint video and image encoder for end-to-end retrieval. arXiv preprint arXiv:2104.00650 ( 2021 ). Max Bain, Arsha Nagrani, Gül Varol, and Andrew Zisserman. 2021. Frozen in time: A joint video and image encoder for end-to-end retrieval. arXiv preprint arXiv:2104.00650 (2021).
5. Gedas Bertasius , Heng Wang , and Lorenzo Torresani . 2021. Is Space-Time At- tention All You Need for Video Understanding? arXiv preprint arXiv:2102.05095 ( 2021 ). Gedas Bertasius, Heng Wang, and Lorenzo Torresani. 2021. Is Space-Time At- tention All You Need for Video Understanding? arXiv preprint arXiv:2102.05095 (2021).