1. YouTube-8M: A large-scale video classification benchmark;Abu-El-Haija,2016
2. ViViT: A video vision transformer;Arnab,2021
3. Collecting highly parallel data for paraphrase evaluation;Chen,2011
4. Chen, S., Jiang, Y., 2019. Motion Guided Spatial Attention for Video Captioning. In: AAAI. pp. 8191–8198.
5. Motion guided region message passing for video captioning;Chen,2021