TAN: a temporal-aware attention network with context-rich representation for boosting proposal generation-Reference-Cited by-同舟云学术

TAN: a temporal-aware attention network with context-rich representation for boosting proposal generation

Published:2024-02-22 Issue:3 Volume:10 Page:3691-3708
ISSN:2199-4536
Container-title:Complex & Intelligent Systems
language:en
Short-container-title:Complex Intell. Syst.

Author:

Jiao Yanyan,Yang Wenzhu,Xing Wenjie,Zeng Shuang,Geng Lei

Abstract

AbstractTemporal action proposal generation in an untrimmed video is very challenging, and comprehensive context exploration is critically important to generate accurate candidates of action instances. This paper proposes a Temporal-aware Attention Network (TAN) that localizes context-rich proposals by enhancing the temporal representations of boundaries and proposals. Firstly, we pinpoint that obtaining precise location information of action instances needs to consider long-distance temporal contexts. To this end, we propose a Global-Aware Attention (GAA) module for boundary-level interaction. Specifically, we introduce two novel gating mechanisms into the top-down interaction structure to incorporate multi-level semantics into video features effectively. Secondly, we design an efficient task-specific Adaptive Temporal Interaction (ATI) module to learn proposal associations. TAN enhances proposal-level contextual representations in a wide range by utilizing multi-scale interaction modules. Extensive experiments on the ActivityNet-1.3 and THUMOS-14 demonstrate the effectiveness of our proposed method, e.g., TAN achieves 73.43% in AR@1000 on THUMOS-14 and 69.01% in AUC on ActivityNet-1.3. Moreover, TAN significantly improves temporal action detection performance when equipped with existing action classification frameworks.

Funder

Important Research Project of Hebei Province

Scientific Research Foundation of Hebei University for Distinguished Young Scholars

Scientific Research Foundation of Colleges and Universities in Hebei Province

Publisher

Springer Science and Business Media LLC

Link

https://link.springer.com/content/pdf/10.1007/s40747-024-01343-0.pdf

Reference60 articles.

1. Arnab, A., et al., ViViT: A Video Vision Transformer, in IEEE/CVF International Conference on Computer Vision. 2021. p. 6836–6846.

2. Bai Y et al (2020) Boundary content graph neural network for temporal action proposal generation. European Conference on Computer Vision. Springer, pp 121–137

3. Bertasius, G., H. Wang, and L. Torresani, Is Space-Time Attention All You Need for Video Understanding?, in International Conference on Machine Learning. 2021, PMLR. p. 813–824.

4. Bochkovskiy, A., C.-Y. Wang, and H.-Y.M. Liao, Yolov4: Optimal speed and accuracy of object detection. arXiv preprint arXiv:2004.10934, 2020.

5. Caba Heilbron, F., et al., Activitynet: A large-scale video benchmark for human activity understanding, in IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2015. p. 961–970.