1. Kuznetsova, P., Ordonez, V., Berg, A.C., Berg, T.L., Choi, Y.: Collective generation of natural image descriptions. In: Meeting of the Association for Computational Linguistics: Long Papers, Korea, Jeju Island, pp. 359–368 (2012)
2. Socher, R., Karpathy, A., Le, Q.V., Manning, C.D., Ng, A.Y.: Grounded compositional semantics for finding and describing images with sentences. Trans. Assoc. Comput. Linguist. 2, 207–218 (2014)
3. Srivastava, N., Salakhutdinov, R.: Multimodal learning with deep Boltzmannmachines. J.Mach. Learn. Res. 15, 2949–2980 (2014)
4. Norouzi, M.,Mikolov,T., Bengio, S., Singer,Y., Shlens, J., Frome, A., Corrado, G.S., Dean, J.: Zero-shot learning by convex combination of semantic embeddings. In: International Conference on Learning Representations ICLR2014, Banff, Canada (2014)
5. Hodosh, M., Young, P., Hockenmaier, J.: Framing image description as a ranking task: data, models and evaluation metrics. J. Artif. Intell. Res. 47, 853–899 (2013)