Learning Semantics-Grounded Vocabulary Representation for Video-Text Retrieval-Reference-Cited by-同舟云学术

Learning Semantics-Grounded Vocabulary Representation for Video-Text Retrieval

Published:2023-10-26 Issue: Volume: Page:
ISSN:
Container-title:Proceedings of the 31st ACM International Conference on Multimedia
language:
Short-container-title:

Author:

Shi Yaya¹^ORCID,Liu Haowei²^ORCID,Xu Haiyang³^ORCID,Ma Zongyang²^ORCID,Ye Qinghao³^ORCID,Hu Anwen³^ORCID,Yan Ming³^ORCID,Zhang Ji³^ORCID,Huang Fei³^ORCID,Yuan Chunfeng²^ORCID,Li Bing²^ORCID,Hu Weiming²^ORCID,Zha Zheng-Jun¹^ORCID

Affiliation:

1. University of Science and Technology of China, Hefei, China

2. Institute of Automation, CAS & University of Chinese Academy of Sciences, Beijing, China

3. DAMO Academy, Alibaba Group, Hangzhou, China

Publisher

ACM

Link

https://dl.acm.org/doi/pdf/10.1145/3581783.3612537

Reference52 articles.

1. Hassan Akbari , Linagzhe Yuan , Rui Qian , Wei-Hong Chuang , Shih-Fu Chang , Yin Cui , and Boqing Gong . 2021 . Vatt: Transformers for multimodal self-supervised learning from raw video, audio and text. arXiv preprint arXiv:2104.11178 (2021). Hassan Akbari, Linagzhe Yuan, Rui Qian, Wei-Hong Chuang, Shih-Fu Chang, Yin Cui, and Boqing Gong. 2021. Vatt: Transformers for multimodal self-supervised learning from raw video, audio and text. arXiv preprint arXiv:2104.11178 (2021).

2. Humam Alwassel Dhruv Mahajan Bruno Korbar Lorenzo Torresani Bernard Ghanem and Du Tran. 2020. Self-Supervised Learning by Cross-Modal Audio-Video Clustering. In NeurIPS. Humam Alwassel Dhruv Mahajan Bruno Korbar Lorenzo Torresani Bernard Ghanem and Du Tran. 2020. Self-Supervised Learning by Cross-Modal Audio-Video Clustering. In NeurIPS.

3. Lisa Anne Hendricks Oliver Wang Eli Shechtman Josef Sivic Trevor Darrell and Bryan Russell. 2017. Localizing moments in video with natural language. In ICCV. 5803--5812. Lisa Anne Hendricks Oliver Wang Eli Shechtman Josef Sivic Trevor Darrell and Bryan Russell. 2017. Localizing moments in video with natural language. In ICCV. 5803--5812.

4. Jinbin Bai Chunhui Liu Feiyue Ni Haofan Wang Mengying Hu Xiaofeng Guo and Lele Cheng. 2022. LaT: Latent Translation with Cycle-Consistency for Video-Text Retrieval. (2022). arxiv: 2207.04858 [cs.CV] Jinbin Bai Chunhui Liu Feiyue Ni Haofan Wang Mengying Hu Xiaofeng Guo and Lele Cheng. 2022. LaT: Latent Translation with Cycle-Consistency for Video-Text Retrieval. (2022). arxiv: 2207.04858 [cs.CV]

5. Yang Bai , Xiaoguang Li , Gang Wang , Chaoliang Zhang , Lifeng Shang , Jun Xu , Zhaowei Wang , Fangshan Wang , and Qun Liu . 2020. SparTerm: Learning Term-based Sparse Representation for Fast Text Retrieval. ArXiv , Vol. abs/ 2010 .00768 ( 2020 ). Yang Bai, Xiaoguang Li, Gang Wang, Chaoliang Zhang, Lifeng Shang, Jun Xu, Zhaowei Wang, Fangshan Wang, and Qun Liu. 2020. SparTerm: Learning Term-based Sparse Representation for Fast Text Retrieval. ArXiv, Vol. abs/2010.00768 (2020).