1. Anderson, P., Fernando, B., Johnson, M., Gould, S., 2016. SPICE: semantic propositional image caption evaluation, in: ECCV, pp. 382–398.
2. Anderson, P., He, X., Buehler, C., Teney, D., Johnson, M., Gould, S., Zhang, L., 2018. Bottom-up and top-down attention for image captioning and visual question answering, in: CVPR, pp. 6077–6086.
3. Chen, C., Mu, S., Xiao, W., Ye, Z., Wu, L., Ju, Q., 2019. Improving image captioning with conditional generative adversarial nets, in: AAAI, pp. 8142–8150.
4. Chung, J., Cho, K., Bengio, Y., 2014. Empirical evaluation of gated recurrent neural networks on sequence modeling. CoRR abs/1412.3555.
5. Devlin, J., Chang, M., Lee, K., Toutanova, K., 2019. BERT: pre-training of deep bidirectional transformers for language understanding, in: NAACL-HLT, pp. 4171–4186.