1. UCF101: A dataset of 101 human actions classes from videos in the wild;soomro,2012
2. Two-stream convo-lutional networks for action recognition in videos;simonyan;ICLRE,2014
3. Visual Grounding in Video for Unsupervised Word Translation
4. Learning to Localize Sound Source in Visual Scenes
5. VideoBERT: A Joint Model for Video and Language Representation Learning