MEST: An Action Recognition Network with Motion Encoder and Spatio-Temporal Module

Author:

Zhang YiORCID

Abstract

As a sub-field of video content analysis, action recognition has received extensive attention in recent years, which aims to recognize human actions in videos. Compared with a single image, video has a temporal dimension. Therefore, it is of great significance to extract the spatio-temporal information from videos for action recognition. In this paper, an efficient network to extract spatio-temporal information with relatively low computational load (dubbed MEST) is proposed. Firstly, a motion encoder to capture short-term motion cues between consecutive frames is developed, followed by a channel-wise spatio-temporal module to model long-term feature information. Moreover, the weight standardization method is applied to the convolution layers followed by batch normalization layers to expedite the training process and facilitate convergence. Experiments are conducted on five public datasets of action recognition, Something-Something-V1 and -V2, Jester, UCF101 and HMDB51, where MEST exhibits competitive performance compared to other popular methods. The results demonstrate the effectiveness of our network in terms of accuracy, computational cost and network scales.

Publisher

MDPI AG

Subject

Electrical and Electronic Engineering,Biochemistry,Instrumentation,Atomic and Molecular Physics, and Optics,Analytical Chemistry

Reference36 articles.

1. Learning from temporal gradient for semi-supervised action recognition;Xiao;Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,2022

2. Convolutional Two-Stream Network Fusion for Video Action Recognition;Feichtenhofer;Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,2016

3. PA3D: Pose-Action 3D Machine for Video Recognition;Yan;Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR,2019

4. A Closer Look at Spatiotemporal Convolutions for Action Recognition;Tran;Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR,2018

5. TSM: Temporal Shift Module for Efficient Video Understanding;Ji;Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV),2019

Cited by 6 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

1. Lightweight hybrid model based on MobileNet-v2 and Vision Transformer for human–robot interaction;Engineering Applications of Artificial Intelligence;2024-01

2. TransNet: A Transfer Learning-Based Network for Human Action Recognition;2023 International Conference on Machine Learning and Applications (ICMLA);2023-12-15

3. Exploring Approaches and Techniques for Human Activity Recognition in Video: A Comprehensive Overview;2023 International Conference on Ambient Intelligence, Knowledge Informatics and Industrial Electronics (AIKIIE);2023-11-02

4. Two-Level Attention Module Based on Spurious-3D Residual Networks for Human Action Recognition;Sensors;2023-02-03

5. WLiT: Windows and Linear Transformer for Video Action Recognition;Sensors;2023-02-02

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3