1. VQA: visual question answering;Antol,2015
2. VizWiz grand challenge: answering visual questions from blind people;Gurari,2018
3. Deep residual learning for image recognition;He,2016
4. Bert: pre-training of deep bidirectional transformers for language understanding;Kenton,2019
5. Show, attend and tell: neural image caption generation with visual attention;Xu,2015