1. Spatio-temporal dynamics and semantic attribute enriched visual encoding for video captioning;Aafaq,2019
2. Bottom-up and top-down attention for image captioning and visual question answering;Anderson,2018
3. Andreas, J., Rohrbach, M., Darrell, T., Klein, D., 2016. Neural module networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 39–48.
4. The “inverse hollywood problem”: From video to scripts and storyboards via causal analysis;Brand,1997
5. Quo vadis, action recognition? A new model and the kinetics dataset;Carreira,2017