1. S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, and D. Parikh, “Vqa: Visual question answering,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), December 2015.
2. D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” 2016.
3. I. Chowdhury, K. Nguyen, C. Fookes, and S. Sridharan, “A cascaded long short-term memory (lstm) driven generic visual question answering (vqa),” in 2017 IEEE International Conference on Image Processing (ICIP), 2017, pp. 1842–1846.
4. Y. N. Dauphin, A. Fan, M. Auli, and D. Grangier, “Language modeling with gated convolutional networks,” CoRR, vol. abs/1612.08083, 2016. [Online]. Available: http://arxiv.org/abs/1612.08083
5. J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Minneapolis, Minnesota: Association for Computational Linguistics, Jun. 2019, pp. 4171–4186. [Online]. Available: https: //aclanthology.org/N19-1423