1. Microsoft COCO: Common objects in context;lin;Proc ECCV,2014
2. Faster R-CNN: Towards real-time object detection with region proposal networks;ren;Proc NIPS,2015
3. VilBERT: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks;lu;Proc NeurIPS,2019
4. BERT: Pre-training of deep bidirectional transformers for language understanding;devlin;Proc NAACL-HLT,2019
5. COOT: Cooperative hierarchical transformer for video-text representation learning;ging;Proc NeurIPS,2020