1. VideoBERT: A Joint Model for Video and Language Representation Learning
2. Multi-modal Dependency Tree for Video Captioning;zhao;NeurIPS,2021
3. Sketch, Ground, and Refine: Top-Down Dense Video Captioning
4. UniVL: A Unified Video and Language Pre-training Model for Multimodal Understanding and Generation;luo,2020
5. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments;banerjee;the ACL Workshop on IEEM for MTS,2005