1. Translating Video Content to Natural Language Descriptions
2. Watch, Listen and Tell: Multi-Modal Weakly Supervised Dense Event Captioning
3. End-to-End Learning of Visual Representations From Uncurated Instructional Videos
4. UniVL: A unified video and language pre-training model for multimodal understanding and generation;luo;ArXiv e-prints,2020
5. ViLBERT: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks;lu;NeurIPS,2019