1. Anderson, P., Fernando, B., Johnson, M., & Gould, S. (2016). Spice: Semantic propositional image caption evaluation. In European conference on computer vision (ECCV).
2. Baker, C. F., Fillmore, C. J., & Lowe, J. B. (1998). The berkeley framenet project. In Proceedings of the annual meeting of the association for computational linguistics (ACL).
3. Ballas, N., Yao, L., Pal, C., & Courville, A. (2016). Delving deeper into convolutional networks for learning video representations. In International conference on learning representations (ICLR).
4. Barbu, A., Bridge, A., Burchill, Z., Coroian, D., Dickinson, S., Fidler, S. et al. (2012). Video in sentences out. In Proceedings of the conference on Uncertainty in artificial intelligence (UAI).
5. Bojanowski, P., Bach, F., Laptev, I., Ponce, J., Schmid, C., & Sivic, J. (2013). Finding actors and actions in movies. In International conference on computer vision (ICCV).