1. Vinvl: Making visual representations matter in vision-language models;Zhang,2021
2. Scaling up vision-language pre-training for image captioning;Hu,2021
3. An image is worth 16x16 words: Transformers for image recognition at scale;Dosovitskiy,2021
4. Meshed-memory transformer for image captioning;Cornia,2020
5. Bleu: a method for automatic evaluation of machine translation;Papineni,2002