1. Injecting Semantic Concepts into End-to-End Image Captioning
2. Vision-Language Pre-Training with Triple Contrastive Learning
3. An Empirical Study of Training End-to-End Vision-and-Language Transformers
4. Show, attend and tell: Neural image caption generation with visual attention;xu;International Conference on Machine Learning,2015
5. Self-supervised pre-training of visual features in the wild;goyal;ArXiv Preprint,2021