EgoVLPv2: Egocentric Video-Language Pre-training with Fusion in the Backbone-Reference-Cited by-同舟云学术

EgoVLPv2: Egocentric Video-Language Pre-training with Fusion in the Backbone

Published:2023-10-01 Issue: Volume:34 Page:5262-5274
ISSN:
Container-title:2023 IEEE/CVF International Conference on Computer Vision (ICCV)
language:
Short-container-title:

Author:

Pramanick Shraman¹,Song Yale²,Nag Sayan³,Lin Kevin Qinghong⁴,Shah Hardik²,Shou Mike Zheng⁴,Chellappa Rama¹,Zhang Pengchuan²

Affiliation:

1. Johns Hopkins University

2. Meta AI

3. University of Toronto

4. National University of Singapore

Publisher

IEEE

Link

http://xplorestaging.ieee.org/ielx7/10376473/10376477/10378463.pdf?arnumber=10378463

Reference112 articles.

1. Vatt: Transformers for multimodal self-supervised learning from raw video, audio and text;Akbari;Advances in Neural Information Processing Systems,2021

2. ViViT: A Video Vision Transformer

3. HierVL: Learning Hierarchical Video-Language Embeddings

4. Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval

5. Vlmo: Unified vision-language pre-training with mixture-of-modality-experts;Bao;Advances in Neural Information Processing Systems,2022

Cited by 3 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. An Outlook into the Future of Egocentric Vision;International Journal of Computer Vision;2024-05-28

2. A Sound Approach: Using Large Language Models to Generate Audio Descriptions for Egocentric Text-Audio Retrieval;ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP);2024-04-14

3. STEPs: Self-Supervised Key Step Extraction and Localization from Unlabeled Procedural Videos;2023 IEEE/CVF International Conference on Computer Vision (ICCV);2023-10-01