1. Yan, R., Hauptmann, A.G.: A review of text and image retrieval approaches for broadcast news video. Inf. Retr. 10, 445–484 (2007)
2. Bernardi, R., Cakici, R., Elliott, D., Erdem, A., Erdem, E., Ikizler-Cinbis, N., Keller, F., Muscat, A., Plank, B.: Automatic description generation from images: a survey of models, datasets, and evaluation measures. J. Artif. Intell. Res. 55, 409–442 (2016)
3. Aloimonos, Y., Aloimonos, Y., Aloimonos, Y.: Computer vision and natural language processing: recent approaches in multimedia and robotics. ACM Comput. Surv. 49, 71 (2016)
4. Kuznetsova, P., Ordonez, V., Berg, A.C., Berg, T.L., Choi, Y.: Collective generation of natural image descriptions. In: Meeting of the Association for Computational Linguistics: Long Papers, Korea, Jeju Island, pp. 359–368 (2012)
5. Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: a deep visual-semantic embedding model. In: International Conference on Neural Information Processing Systems, Neural Information Processing Systems Foundation, Lake Tahoe, pp. 2121–2129 (2013)