1. 1) G. Awad, K. Curtis, A. A. Butt, J. Fiscus, A. Godil, Y. Lee, A. Delgado, J. Zhang, E. Godard, B. Chocot, L. Diduch, J. Liu, Y. Graham, G. Quénot, “An overview on the evaluated video retrieval tasks at TRECVID 2022,” In Proc. of TRECVID 2022, (2022).
2. 2) A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, M. Ranzato, T. Mikolov, “DeViSE: A Deep Visual-Semantic Embedding Model,” In Proc. of Advances in Neural Information Processing Systems (NIPS), 26, (2013).
3. 3) R. Kiros, R. Salakhutdinov, R. S. Zemel, “Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models,” In Proc. of NIPS 2014 Deep Learning Workshop, (2014).
4. 4) O. Vinyals, A. Toshev, S. Bengio, D. Erhan, “Show and Tell: A Neural Image Caption Generator,” In Proc. of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (2015).
5. 5) A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, I. Sutskever, “Learning Transferable Visual Models From Natural Language Supervision,” arXiv:2103.00020, (2021).