1. Fusion of Detected Objects in Text for Visual Question Answering
2. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering
3. Hangbo Bao , Wenhui Wang , Li Dong , Qiang Liu , Owais Khan Mohammed , Kriti Aggarwal, Subhojit Som, Songhao Piao, and Furu Wei. 2022 . VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts. In Advances in Neural Information Processing Systems . 32897--32912. Hangbo Bao, Wenhui Wang, Li Dong, Qiang Liu, Owais Khan Mohammed, Kriti Aggarwal, Subhojit Som, Songhao Piao, and Furu Wei. 2022. VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts. In Advances in Neural Information Processing Systems. 32897--32912.
4. Kang Chen and Xiangqian Wu . 2023 . VTQA: Visual Text Question Answering via Entity Alignment and Cross-Media Reasoning. arXiv preprint arXiv:2303.02635 (2023). Kang Chen and Xiangqian Wu. 2023. VTQA: Visual Text Question Answering via Entity Alignment and Cross-Media Reasoning. arXiv preprint arXiv:2303.02635 (2023).
5. Zhe Gan Yen-Chun Chen Linjie Li Chen Zhu Yu Cheng and Jingjing Liu. 2020. Large-Scale Adversarial Training for Vision-and-Language Representation Learning. In Advances in Neural Information Processing Systems. 6616--6628. Zhe Gan Yen-Chun Chen Linjie Li Chen Zhu Yu Cheng and Jingjing Liu. 2020. Large-Scale Adversarial Training for Vision-and-Language Representation Learning. In Advances in Neural Information Processing Systems. 6616--6628.