1. S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C.L. Zitnick, D. Parikh, Vqa: Visual question answering, in: Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 2425–2433.
2. Visual question answering on image sets;Bansal,2020
3. D. Xu, Z. Zhao, J. Xiao, F. Wu, H. Zhang, X. He, Y. Zhuang, Video question answering via gradually refined attention over appearance and motion, in: Proceedings of the 25th ACM International Conference on Multimedia, 2017, pp. 1645–1653.
4. Video question answering: Datasets, algorithms and challenges;Zhong,2022
5. Tvqa: Localized, compositional video question answering;Lei,2018