1. Visualbert: Asimple and performant baseline for vision and language;li,2019
2. Align before fuse: Vision and language representation learning with momentum distillation;li;Advances in neural information processing systems,2021
3. Merlot: Multimodal neural script knowledge models;zellers;Advances in neural information processing systems,2021
4. Microsoft coco: Common objects in context;lin;Computer Vision–ECCV 2014 13th European Conference Zurich Switzerland September 6-12 2014 Proceedings,2014
5. Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks