1. Kiros R, Salakhutdinov R, Zemel R (2014) Multi-modal neural language models, In: Proceedings of the 31st international conference on machine learning, pp 595–603
2. Xu K, Ba JL, Kiros R, Cho K, Courville AC, Salakhudinov R, Zemel R, Bengio Y (2015) Show, attend and tell: Neural image caption generation with visual attention, In: Proceedings of the 32nd international conference on machine learning, pp 2048–2057
3. Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L (2018) Bottom-up and top-down attention for image captioning and visual question answering, In: 2018 IEEE conference on computer vision and pattern recognition, pp 6077–6086
4. Mathews AP, Xie L, He X (2016) Senticap: Generating image descriptions with sentiments, In: Proceedings of the 30th AAAI conference on artificial intelligence, pp 3574-3580
5. Devlin J, Chang M-W, Lee K, Toutanova K (2019) Bert: Pre-training of deep bidirectional transformers for language understanding, In: Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics, pp 4171-4186