HiT: Hierarchical Transformer with Momentum Contrast for Video-Text Retrieval-Reference-Cited by-同舟云学术

HiT: Hierarchical Transformer with Momentum Contrast for Video-Text Retrieval

Published:2021-10 Issue: Volume: Page:
ISSN:
Container-title:2021 IEEE/CVF International Conference on Computer Vision (ICCV)
language:
Short-container-title:

Author:

Liu Song¹,Fan Haoqi²,Qian Shengsheng³,Chen Yiru⁴,Ding Wenkui⁴,Wang Zhongyuan⁴

Affiliation:

1. Peking University

2. FAIR

3. Institute of Automation, CAS,National Lab of Pattern Recognition

4. Kuaishou Technology

Publisher

IEEE

Link

http://xplorestaging.ieee.org/ielx7/9709627/9709628/09710620.pdf?arnumber=9710620

Reference70 articles.

1. ActBERT: Learning Global-Local Video-Text Representations

2. Support-set bottlenecks for video-text representation learning;patrick;ICLRE,2021

3. Representation learning with contrastive predictive coding;van den oord,2018

4. Univilm: A unified video and language pre-training model for multimodal understanding and generation;luo;CoRR,2020

5. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks;lu;Advances in neural information processing systems,2019

Cited by 92 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Multi-task Information Enhancement Recommendation model for educational Self-Directed Learning System;Expert Systems with Applications;2024-10

2. CLIP2TF:Multimodal video–text retrieval for adolescent education;Displays;2024-09

3. An empirical study of excitation and aggregation design adaptions in CLIP4Clip for video–text retrieval;Neurocomputing;2024-09

4. LSECA: local semantic enhancement and cross aggregation for video-text retrieval;International Journal of Multimedia Information Retrieval;2024-07-22

5. Multilevel Semantic Interaction Alignment for Video–Text Cross-Modal Retrieval;IEEE Transactions on Circuits and Systems for Video Technology;2024-07