Look, Listen, and Attend: Co-Attention Network for Self-Supervised Audio-Visual Representation Learning-Reference-Cited by-同舟云学术

Look, Listen, and Attend: Co-Attention Network for Self-Supervised Audio-Visual Representation Learning

Published:2020-10-12 Issue: Volume: Page:
ISSN:
Container-title:Proceedings of the 28th ACM International Conference on Multimedia
language:
Short-container-title:

Author:

Cheng Ying¹,Wang Ruize¹,Pan Zhihao¹,Feng Rui¹,Zhang Yuejie¹

Affiliation:

1. Fudan University, Shanghai, China

Funder

National Natural Science Foundation of China

Shanghai Municipal Science and Technology Major Project

Publisher

ACM

Link

https://dl.acm.org/doi/pdf/10.1145/3394171.3413869

Reference45 articles.

1. Sami Abu-El-Haija Nisarg Kothari Joonseok Lee Paul Natsev George Toderici Balakrishnan Varadarajan and Sudheendra Vijayanarasimhan. 2016. Youtube-8m: A large-scale video classification benchmark. arXiv preprint arXiv:1609.08675 (2016). Sami Abu-El-Haija Nisarg Kothari Joonseok Lee Paul Natsev George Toderici Balakrishnan Varadarajan and Sudheendra Vijayanarasimhan. 2016. Youtube-8m: A large-scale video classification benchmark. arXiv preprint arXiv:1609.08675 (2016).

2. Humam Alwassel Dhruv Mahajan Lorenzo Torresani Bernard Ghanem and Du Tran. 2019. Self-Supervised Learning by Cross-Modal Audio-Video Clustering. arXiv preprint arXiv:1911.12667 (2019). Humam Alwassel Dhruv Mahajan Lorenzo Torresani Bernard Ghanem and Du Tran. 2019. Self-Supervised Learning by Cross-Modal Audio-Video Clustering. arXiv preprint arXiv:1911.12667 (2019).

3. Look, Listen and Learn

4. Objects that Sound

Cited by 50 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. A Review of Modern Recommender Systems Using Generative Models (Gen-RecSys);Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining;2024-08-24

2. Audiovisual Moments in Time: A large-scale annotated dataset of audiovisual actions;PLOS ONE;2024-04-01

3. Tri-CLT: Learning Tri-Modal Representations with Contrastive Learning and Transformer for Multimodal Sentiment Recognition;Information Technology and Control;2024-03-22

4. Audio-visual saliency prediction with multisensory perception and integration;Image and Vision Computing;2024-03

5. LAVSS: Location-Guided Audio-Visual Spatial Audio Separation;2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV);2024-01-03