1. Stanislaw Antol , Aishwarya Agrawal , Jiasen Lu , Margaret Mitchell , Dhruv Batra , C Lawrence Zitnick , and Devi Parikh . 2015 . Vqa: Visual question answering. In ICCV. 2425--2433. Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. 2015. Vqa: Visual question answering. In ICCV. 2425--2433.
2. 3-D Relation Network for visual relation recognition in videos
3. Joao Carreira and Andrew Zisserman. 2017. Quo vadis action recognition? a new model and the kinetics dataset. In CVPR. 6299--6308. Joao Carreira and Andrew Zisserman. 2017. Quo vadis action recognition? a new model and the kinetics dataset. In CVPR. 6299--6308.
4. Shuo Chen , Zenglin Shi , Pascal Mettes , and Cees G. M . Snoek . 2021 . Social Fabric : Tubelet Compositions for Video Relation Detection. In ICCV. Shuo Chen, Zenglin Shi, Pascal Mettes, and Cees G. M. Snoek. 2021. Social Fabric: Tubelet Compositions for Video Relation Detection. In ICCV.
5. Siqi Chen Jun Xiao and Long Chen. 2023 b. Video scene graph generation from single-frame weak supervision. In ICLR. Siqi Chen Jun Xiao and Long Chen. 2023 b. Video scene graph generation from single-frame weak supervision. In ICLR.