1. Max Bain, Arsha Nagrani, Andrew Brown, and Andrew Zisserman. 2020. Condensed movies: Story based retrieval with contextual embeddings. In ACCV. 460–479.
2. Lorenzo Baraldi, Costantino Grana, and Rita Cucchiara. 2015. A deep Siamese network for scene detection in broadcast videos. In MM. 1199–1202.
3. Longformer: The long-document transformer;Beltagy Iz;CoRR,2020
4. Gedas Bertasius, Heng Wang, and Lorenzo Torresani. 2021. Is space-time attention all you need for video understanding?. In ICML (Proceedings of Machine Learning Research), Vol. 139. 813–824.
5. Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. 2020. End-to-end object detection with transformers. In ECCV. 213–229.