1. Show, attend and tell: neural image caption generation with visual attention;Xu,2015
2. Image caption with global-local attention;Li,2017
3. VQA: visual question answering;Antol,2017
4. Making the V in VQA matter: elevating the role of image understanding in visual question answering;Goyal,2017
5. Adversarial cross-modal retrieval;Wang,2017