EMO-MoviNet: Enhancing Action Recognition in Videos with EvoNorm, Mish Activation, and Optimal Frame Selection for Efficient Mobile Deployment-Reference-Cited by-同舟云学术

EMO-MoviNet: Enhancing Action Recognition in Videos with EvoNorm, Mish Activation, and Optimal Frame Selection for Efficient Mobile Deployment

Published:2023-09-27 Issue:19 Volume:23 Page:8106
ISSN:1424-8220
Container-title:Sensors
language:en
Short-container-title:Sensors

Author:

Hussain Tarique¹,Memon Zulfiqar Ali¹^ORCID,Qureshi Rizwan¹^ORCID,Alam Tanvir²^ORCID

Affiliation:

1. Fast School of Computing, National University of Computer and Emerging Sciences, Karachi Campus, Karachi 75030, Pakistan

2. College of Science and Engineering, Hamad Bin Khalifa University, Doha 34110, Qatar

Abstract

The primary goal of this study is to develop a deep neural network for action recognition that enhances accuracy and minimizes computational costs. In this regard, we propose a modified EMO-MoviNet-A2* architecture that integrates Evolving Normalization (EvoNorm), Mish activation, and optimal frame selection to improve the accuracy and efficiency of action recognition tasks in videos. The asterisk notation indicates that this model also incorporates the stream buffer concept. The Mobile Video Network (MoviNet) is a member of the memory-efficient architectures discovered through Neural Architecture Search (NAS), which balances accuracy and efficiency by integrating spatial, temporal, and spatio-temporal operations. Our research implements the MoviNet model on the UCF101 and HMDB51 datasets, pre-trained on the kinetics dataset. Upon implementation on the UCF101 dataset, a generalization gap was observed, with the model performing better on the training set than on the testing set. To address this issue, we replaced batch normalization with EvoNorm, which unifies normalization and activation functions. Another area that required improvement was key-frame selection. We also developed a novel technique called Optimal Frame Selection (OFS) to identify key-frames within videos more effectively than random or densely frame selection methods. Combining OFS with Mish nonlinearity resulted in a 0.8–1% improvement in accuracy in our UCF101 20-classes experiment. The EMO-MoviNet-A2* model consumes 86% fewer FLOPs and approximately 90% fewer parameters on the UCF101 dataset, with a trade-off of 1–2% accuracy. Additionally, it achieves 5–7% higher accuracy on the HMDB51 dataset while requiring seven times fewer FLOPs and ten times fewer parameters compared to the reference model, Motion-Augmented RGB Stream (MARS).

Funder

Qatar National Library (QNL), Doha, Qatar

Hamad Bin Khalifa University, Qatar Foundation, Education City, Doha, Qatar

Publisher

MDPI AG

Subject

Electrical and Electronic Engineering,Biochemistry,Instrumentation,Atomic and Molecular Physics, and Optics,Analytical Chemistry

Link

https://www.mdpi.com/1424-8220/23/19/8106/pdf

Reference60 articles.

1. Automatic video classification: A survey of the literature;Brezeale;IEEE Trans. Syst. Man Cybern. Part C Appl. Rev.,2008

2. Handcrafted vs. non-handcrafted features for computer vision classification;Nanni;Pattern Recognit.,2017

3. Kondratyuk, D., Yuan, L., Li, Y., Zhang, L., Tan, M., Brown, M., and Gong, B. (2021, January 20–25). Movinets: Mobile video networks for efficient video recognition. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA.

4. Qiu, Z., Yao, T., Ngo, C.W., Tian, X., and Mei, T. (2019, January 15–20). Learning spatio-temporal representation with local and global diffusion. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA.

5. Human action recognition in videos using kinematic features and multiple instance learning;Ali;IEEE Trans. Pattern Anal. Mach. Intell.,2008