1. Chen, F., Chen, X., Xu, S., Xu, B.: Improving cross-modal understanding in visual dialog via contrastive learning. In: ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7937–7941 (2022). https://api.semanticscholar.org/CorpusID:248218567
2. Chen, Y.-C., et al.: UNITER: UNiversal image-TExt representation learning. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12375, pp. 104–120. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58577-8_7. https://api.semanticscholar.org/CorpusID:216080982
3. Dou, Z.Y., et al.: An empirical study of training end-to-end vision-and-language transformers. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 18145–18155 (2021). https://api.semanticscholar.org/CorpusID:241033425
4. Faghri, F., Fleet, D.J., Kiros, J.R., Fidler, S.: VSE++: improving visual-semantic embeddings with hard negatives. In: British Machine Vision Conference (2017). https://api.semanticscholar.org/CorpusID:6095318
5. Gan, Z., Chen, Y.C., Li, L., Zhu, C., Cheng, Y., Liu, J.: Large-scale adversarial training for vision-and-language representation learning. ArXiv abs/2006.06195 (2020). https://api.semanticscholar.org/CorpusID:219573512