Author:
Jiao Yanyan,Yang Wenzhu,Xing Wenjie,Zeng Shuang,Geng Lei
Abstract
AbstractTemporal action proposal generation in an untrimmed video is very challenging, and comprehensive context exploration is critically important to generate accurate candidates of action instances. This paper proposes a Temporal-aware Attention Network (TAN) that localizes context-rich proposals by enhancing the temporal representations of boundaries and proposals. Firstly, we pinpoint that obtaining precise location information of action instances needs to consider long-distance temporal contexts. To this end, we propose a Global-Aware Attention (GAA) module for boundary-level interaction. Specifically, we introduce two novel gating mechanisms into the top-down interaction structure to incorporate multi-level semantics into video features effectively. Secondly, we design an efficient task-specific Adaptive Temporal Interaction (ATI) module to learn proposal associations. TAN enhances proposal-level contextual representations in a wide range by utilizing multi-scale interaction modules. Extensive experiments on the ActivityNet-1.3 and THUMOS-14 demonstrate the effectiveness of our proposed method, e.g., TAN achieves 73.43% in AR@1000 on THUMOS-14 and 69.01% in AUC on ActivityNet-1.3. Moreover, TAN significantly improves temporal action detection performance when equipped with existing action classification frameworks.
Funder
Important Research Project of Hebei Province
Scientific Research Foundation of Hebei University for Distinguished Young Scholars
Scientific Research Foundation of Colleges and Universities in Hebei Province
Publisher
Springer Science and Business Media LLC
Reference60 articles.
1. Arnab, A., et al., ViViT: A Video Vision Transformer, in IEEE/CVF International Conference on Computer Vision. 2021. p. 6836–6846.
2. Bai Y et al (2020) Boundary content graph neural network for temporal action proposal generation. European Conference on Computer Vision. Springer, pp 121–137
3. Bertasius, G., H. Wang, and L. Torresani, Is Space-Time Attention All You Need for Video Understanding?, in International Conference on Machine Learning. 2021, PMLR. p. 813–824.
4. Bochkovskiy, A., C.-Y. Wang, and H.-Y.M. Liao, Yolov4: Optimal speed and accuracy of object detection. arXiv preprint arXiv:2004.10934, 2020.
5. Caba Heilbron, F., et al., Activitynet: A large-scale video benchmark for human activity understanding, in IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2015. p. 961–970.