MEST: An Action Recognition Network with Motion Encoder and Spatio-Temporal Module-Reference-Cited by-同舟云学术

MEST: An Action Recognition Network with Motion Encoder and Spatio-Temporal Module

Published:2022-09-01 Issue:17 Volume:22 Page:6595
ISSN:1424-8220
Container-title:Sensors
language:en
Short-container-title:Sensors

Author:

Zhang Yi^ORCID

Abstract

As a sub-field of video content analysis, action recognition has received extensive attention in recent years, which aims to recognize human actions in videos. Compared with a single image, video has a temporal dimension. Therefore, it is of great significance to extract the spatio-temporal information from videos for action recognition. In this paper, an efficient network to extract spatio-temporal information with relatively low computational load (dubbed MEST) is proposed. Firstly, a motion encoder to capture short-term motion cues between consecutive frames is developed, followed by a channel-wise spatio-temporal module to model long-term feature information. Moreover, the weight standardization method is applied to the convolution layers followed by batch normalization layers to expedite the training process and facilitate convergence. Experiments are conducted on five public datasets of action recognition, Something-Something-V1 and -V2, Jester, UCF101 and HMDB51, where MEST exhibits competitive performance compared to other popular methods. The results demonstrate the effectiveness of our network in terms of accuracy, computational cost and network scales.

Publisher

MDPI AG

Subject

Electrical and Electronic Engineering,Biochemistry,Instrumentation,Atomic and Molecular Physics, and Optics,Analytical Chemistry

Link

https://www.mdpi.com/1424-8220/22/17/6595/pdf

Reference36 articles.

1. Learning from temporal gradient for semi-supervised action recognition;Xiao;Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,2022

2. Convolutional Two-Stream Network Fusion for Video Action Recognition;Feichtenhofer;Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,2016

3. PA3D: Pose-Action 3D Machine for Video Recognition;Yan;Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR,2019

4. A Closer Look at Spatiotemporal Convolutions for Action Recognition;Tran;Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR,2018

5. TSM: Temporal Shift Module for Efficient Video Understanding;Ji;Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV),2019

Cited by 6 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Lightweight hybrid model based on MobileNet-v2 and Vision Transformer for human–robot interaction;Engineering Applications of Artificial Intelligence;2024-01

2. TransNet: A Transfer Learning-Based Network for Human Action Recognition;2023 International Conference on Machine Learning and Applications (ICMLA);2023-12-15

3. Exploring Approaches and Techniques for Human Activity Recognition in Video: A Comprehensive Overview;2023 International Conference on Ambient Intelligence, Knowledge Informatics and Industrial Electronics (AIKIIE);2023-11-02

4. Two-Level Attention Module Based on Spurious-3D Residual Networks for Human Action Recognition;Sensors;2023-02-03

5. WLiT: Windows and Linear Transformer for Video Action Recognition;Sensors;2023-02-02