Multiple Temporal Pooling Mechanisms for Weakly Supervised Temporal Action Localization-Reference-Cited by-同舟云学术

Multiple Temporal Pooling Mechanisms for Weakly Supervised Temporal Action Localization

Published:2023-02-25 Issue:3 Volume:19 Page:1-19
ISSN:1551-6857
Container-title:ACM Transactions on Multimedia Computing, Communications, and Applications
language:en
Short-container-title:ACM Trans. Multimedia Comput. Commun. Appl.

Author:

Dou Peng¹^ORCID,Zeng Ying¹^ORCID,Wang Zhuoqun¹^ORCID,Hu Haifeng¹^ORCID

Affiliation:

1. School of Electronics and Information Technology, Sun Yat-sen University, Guangzhou, Guangdong, People’s Republic of China

Abstract

Recent action localization works learn in a weakly supervised manner to avoid the expensive cost of human labeling. Those works are mostly based on the Multiple Instance Learning framework, where temporal pooling is an indispensable part that usually relies on the guidance of snippet-level Class Activation Sequences (CAS) . However, we observe that previous works only leverage a simple convolutional neural network for the generation of CAS, which ignores the weak discriminative foreground action segments and the background ones, and meanwhile, the relationship between different actions has not been considered. To solve this problem, we propose multiple temporal pooling mechanisms (MTP) for a more sufficient information utilization. Specifically, with the design of the Foreground Variance Branch, Dual Foreground Attention Branch and Hybrid Attention Fine-tuning Branch, MTP can leverage more effective information from different aspects and generate different CASs to guide the learning of temporal pooling. Moreover, different loss functions are designed for a better optimization of individual branches, aiming to effectively distinguish the action from the background. Our method shows excellent results on the THUMOS14 and ActivityNet1.2 datasets.

Funder

National Natural Science Foundation of China

Publisher

Association for Computing Machinery (ACM)

Subject

Computer Networks and Communications,Hardware and Architecture

Link

https://dl.acm.org/doi/pdf/10.1145/3567828

Reference59 articles.

1. Shyamal Buch, Victor Escorcia, Bernard Ghanem, Li Fei-Fei, and Juan Carlos Niebles. 2019. End-to-end, single-stream temporal action detection in untrimmed videos. In Proceedings of the British Machine Vision Conference 2017. British Machine Vision Association.

2. SST: Single-Stream Temporal Action Proposals

3. ActivityNet: A large-scale video benchmark for human activity understanding

4. Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset

5. Rethinking the Faster R-CNN Architecture for Temporal Action Localization

Cited by 2 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Discriminative Action Snippet Propagation Network for Weakly Supervised Temporal Action Localization;ACM Transactions on Multimedia Computing, Communications, and Applications;2024-03-08

2. Snippet-to-Prototype Contrastive Consensus Network for Weakly Supervised Temporal Action Localization;IEEE Transactions on Multimedia;2024