1. Nayyer Aafaq, Naveed Akhtar, Wei Liu, Syed Zulqarnain Gilani, and Ajmal Mian. 2019. Spatio-temporal dynamics and semantic attribute enriched visual encoding for video captioning. In CVPR. 12487–12496.
2. VQA: Visual question answering;Agrawal Aishwarya;Int. J. Comput. Vis.,2017
3. George Awad, Jonathan Fiscus, David Joy, Martial Michel, Alan F. Smeaton, Wessel Kraaij, Georges Quénot, Maria Eskevich, Robin Aly, Roeland Ordelman, Gareth J. F. Jones, Benoit Huet, and Martha Larson. 2016. Evaluating video search, video event detection, localization, and hyperlinking. In TRECVID.
4. Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. In ICLR.
5. Nicolas Ballas, Li Yao, Chris Pal, and Aaron C. Courville. 2016. Delving deeper into convolutional networks for learning video representations. In ICLR.