1. Bottom-up and top-down attention for image captioning and visual question answering;Anderson,2018
2. Dynamic fusion with intra-and inter-modality attention flow for visual question answering;Gao,2019
3. Long-term recurrent convolutional networks for visual recognition and description;Donahue,2015
4. Show attend and tell: Neural image caption generation with visual attention;Xu,2015
5. J. Ji, Y. Luo, X. Sun, F. Chen, G. Luo, Y. Wu, Y. Gao, R. Ji, Improving image captioning by leveraging intra-and inter-layer global representation in transformer network, arXiv preprint arXiv:2012.07061 (2020).