Research of the models for sign gesture recognition using 3D convolutional neural networks and visual transformers-Reference-Cited by-同舟云学术

Research of the models for sign gesture recognition using 3D convolutional neural networks and visual transformers

Published:2023 Issue:2 Volume:5 Page:33-40
ISSN:2707-1898
Container-title:Ukrainian Journal of Information Technology
language:
Short-container-title:UJIT

Author:

,Chornenkyi V. Ya.^ORCID,Kazymyra I. Ya.^ORCID,

Abstract

The work primarily focuses on addressing the contemporary challenge of hand gesture recognition, driven by the overarching objectives of revolutionizing military training methodologies, enhancing human-machine interactions, and facilitating improved communication between individuals with disabilities and machines. In-depth scrutiny of the methods for hand gesture recognition involves a comprehensive analysis, encompassing both established historical computer vision approaches and the latest deep learning trends available in the present day. This investigation delves into the fundamental principles that underpin the design of models utilizing 3D convolutional neural networks and visual transformers. Within the 3D-CNN architecture that was analyzed, a convolutional neural network with two convolutional layers and two pooling layers is considered. Each 3D convolution is obtained by convolving a 3D filter kernel and summing multiple adjacent frames to create a 3D cube. The visual transformer architecture that is consisting of a visual transformer with Linear Projection, a Transformer Encoder, and two sub-layers: the Multi-head Self-Attention (MSA) layer and the feedforward layer, also known as the Multi-Layer Perceptron (MLP), is considered. This research endeavors to push the boundaries of hand gesture recognition by deploying models trained on the ASL and NUS-II datasets, which encompass a diverse array of sign language images. The performance of these models is assessed after 20 training epochs, drawing insights from various performance metrics, including recall, precision, and the F1 score. Additionally, the study investigates the impact on model performance when adopting the ViT architecture after both 20 and 40 training epochs were performed. This analysis unveils the scenarios in which 3D convolutional neural networks and visual transformers achieve superior accuracy results. Simultaneously, it sheds light on the inherent constraints that accompany each approach within the ever-evolving landscape of environmental variables and computational resources. The research identifies cutting-edge architectural paradigms for hand gesture recognition, rooted in deep learning, which hold immense promise for further exploration and eventual implementation and integration into software products.

Publisher

Lviv Polytechnic National University

Reference17 articles.

1. 1. Molchanov, P., Gupta, S., Kim, K., & Kautz, J. (2015). Hand gesture recognition with 3D convolutional neural networks. http://dx.doi.org/10.1109/CVPRW.2015.7301342

2. 2. Molchanov, P., Gupta, S., Kim, K., & Pulli, K. (2015). Multi-sensor system for driver's hand-gesture recognition. 2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG), 1, 1-8. https://doi.org/10.1109/FG.2015.7163132

3. 3. Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. (2014). Large-scale video classification with convolutional neural networks. 2014 IEEE Conference on Computer Vision and Pattern Recognition, 223, 1725-1732. https://doi.org/10.1109/CVPR.2014.223

4. 4. Ohn-Bar, E., & Trivedi, M. M. (2014). Hand Gesture Recognition in Real Time for Automotive Interfaces: A Multimodal Vision-Based Approach and Evaluations. IEEE Transactions on Intelligent Transportation Systems, 15, 2368-2377. https://doi.org/10.1109/TITS.2014.2337331

5. 5. Simonyan, K., & Zisserman, A. (2014). Two-stream convolutional networks for action recognition. https://doi.org/10.48550/arXiv.1406.2199