StNet: Local and Global Spatial-Temporal Modeling for Action Recognition-Reference-Cited by-同舟云学术

StNet: Local and Global Spatial-Temporal Modeling for Action Recognition

Published:2019-07-17 Issue: Volume:33 Page:8401-8408
ISSN:2374-3468
Container-title:Proceedings of the AAAI Conference on Artificial Intelligence
language:
Short-container-title:AAAI

Author:

He Dongliang,Zhou Zhichao,Gan Chuang,Li Fu,Liu Xiao,Li Yandong,Wang Limin,Wen Shilei

Abstract

Despite the success of deep learning for static image understanding, it remains unclear what are the most effective network architectures for spatial-temporal modeling in videos. In this paper, in contrast to the existing CNN+RNN or pure 3D convolution based approaches, we explore a novel spatialtemporal network (StNet) architecture for both local and global modeling in videos. Particularly, StNet stacks N successive video frames into a super-image which has 3N channels and applies 2D convolution on super-images to capture local spatial-temporal relationship. To model global spatialtemporal structure, we apply temporal convolution on the local spatial-temporal feature maps. Specifically, a novel temporal Xception block is proposed in StNet, which employs a separate channel-wise and temporal-wise convolution over the feature sequence of a video. Extensive experiments on the Kinetics dataset demonstrate that our framework outperforms several state-of-the-art approaches in action recognition and can strike a satisfying trade-off between recognition accuracy and model complexity. We further demonstrate the generalization performance of the leaned video representations on the UCF101 dataset.

Publisher

Association for the Advancement of Artificial Intelligence (AAAI)

Subject

General Medicine

Cited by 55 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Weakly-Supervised Action Learning in Procedural Task Videos via Process Knowledge Decomposition;IEEE Transactions on Circuits and Systems for Video Technology;2024-07

2. Multimedia Human-Computer Interaction Method in Video Animation Based on Artificial Intelligence Technology;International Journal of Information Technology and Web Engineering;2024-05-24

3. Traffic flow prediction: A 3D adaptive multi‐module joint modeling approach integrating spatial‐temporal patterns to capture global features;Journal of Forecasting;2024-05-18

4. Motion sensitive network for action recognition in control and decision-making of autonomous systems;Frontiers in Neuroscience;2024-03-25

5. Multimodal fusion for audio-image and video action recognition;Neural Computing and Applications;2024-01-09