1. VQA: Visual Question Answering
2. Antoine Bordes , Nicolas Usunier , Alberto Garcia-Duran , Jason Weston , and Oksana Yakhnenko . 2013. Translating embeddings for modeling multi-relational data. Advances in neural information processing systems , Vol. 26 ( 2013 ). Antoine Bordes, Nicolas Usunier, Alberto Garcia-Duran, Jason Weston, and Oksana Yakhnenko. 2013. Translating embeddings for modeling multi-relational data. Advances in neural information processing systems, Vol. 26 (2013).
3. Soravit Changpinyo Piyush Sharma Nan Ding and Radu Soricut. 2021. Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts. In CVPR. Soravit Changpinyo Piyush Sharma Nan Ding and Radu Soricut. 2021. Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts. In CVPR.
4. Kezhen Chen , Qiuyuan Huang , Yonatan Bisk , Daniel McDuff , and Jianfeng Gao . 2021 . Kb-vlp: Knowledge based vision and language pretraining. In ICML , workshop. Kezhen Chen, Qiuyuan Huang, Yonatan Bisk, Daniel McDuff, and Jianfeng Gao. 2021. Kb-vlp: Knowledge based vision and language pretraining. In ICML, workshop.
5. Wenhu Chen , Hexiang Hu , Xi Chen , Pat Verga , and William W Cohen . 2022a. MuRAG: Multimodal Retrieval-Augmented Generator for Open Question Answering over Images and Text. arXiv preprint arXiv:2210.02928 ( 2022 ). Wenhu Chen, Hexiang Hu, Xi Chen, Pat Verga, and William W Cohen. 2022a. MuRAG: Multimodal Retrieval-Augmented Generator for Open Question Answering over Images and Text. arXiv preprint arXiv:2210.02928 (2022).