1. Farhadi, A., Hejrati, M., Sadeghi, M. A., Young, P., Rashtchian, C., Hockenmaier, J., & Forsyth, D. (2010). Every picture tells a story: Generating sentences from images. In Proceedings of the 11th European Conference on Computer Vision: Part IV, ECCV’10 (pp. 15–29). Berlin, Heidelberg: Springer.
2. Kuznetsova, P., Ordonez, V., Berg, A. C., Berg, T. L. , & Choi, Y. (2012). Collective generation of natural image descriptions. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers. ACL’12 (Vol. 1, pp. 359–368). Stroudsburg, PA, USA. Association for Computational Linguistics.
3. Li, S., Kulkarni, G., Berg, T. L., Berg, A. C., & Choi, Y. (2011). Composing simple image descriptions using web-scale n-grams. In Proceedings of the Fifteenth Conference on Computational Natural Language Learning, CoNLL’11 (pp. 220–228). Stroudsburg, PA, USA: Association for Computational Linguistics.
4. Chen, X., & Zitnick, C. L. (2014). Learning a recurrent visual representation for image caption generation. CoRR, abs/1411.5654.
5. Mao, J., Xu, W., Yang, Y., Wang, J., & Yuille, A. L. (2014). Deep captioning with multimodal recurrent neural networks (m-rnn). CoRR, abs/1412.6632.