1. Stacked cross attention for image-text matching;Lee;ECCV,2018
2. VSE++: improving visual-semantic embeddings with hard negatives;Faghri;BMVC,2018
3. Linking image and text with 2-way nets;Eisenschtat;CVPR,2017
4. K. Xu, J. Ba, R. Kiros, K. Cho, A. C. Courville, R. Salakhutdinov, R. S. Zemel, Y. Bengio, Show, attend and tell: Neural image caption generation with visual attention, in: ICML, 2015, pp. 2048–2057.
5. O. Vinyals, A. Toshev, S. Bengio, D. Erhan, Show and tell: Lessons learned from the 2015 MSCOCO image captioning challenge, PAMI (2017) 652–663.