1. ActBERT: Learning Global-Local Video-Text Representations
2. Support-set bottlenecks for video-text representation learning;patrick;ICLRE,2021
3. Representation learning with contrastive predictive coding;van den oord,2018
4. Univilm: A unified video and language pre-training model for multimodal understanding and generation;luo;CoRR,2020
5. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks;lu;Advances in neural information processing systems,2019