A Performance Comparison of Japanese Sign Language Recognition with ViT and CNN Using Angular Features-Reference-Cited by-同舟云学术

A Performance Comparison of Japanese Sign Language Recognition with ViT and CNN Using Angular Features

Published:2024-04-11 Issue:8 Volume:14 Page:3228
ISSN:2076-3417
Container-title:Applied Sciences
language:en
Short-container-title:Applied Sciences

Author:

Kondo Tamon¹^ORCID,Narumi Sakura¹,He Zixun²^ORCID,Shin Duk²^ORCID,Kang Yousun²

Affiliation:

1. Graduate School of Engineering, Tokyo Polytechnic University, Atsugi 243-0218, Kanagawa, Japan

2. Faculty of Engineering, Tokyo Polytechnic University, Atsugi 243-0218, Kanagawa, Japan

Abstract

In recent years, developments in deep learning technology have driven significant advancements in research aimed at facilitating communication with individuals who have hearing impairments. The focus has been on enhancing automatic recognition and translation systems for sign language. This study proposes a novel approach using a vision transformer (ViT) for recognizing Japanese Sign Language. Our method employs a pose estimation library, MediaPipe, to extract the positional coordinates of each finger joint within video frames and generate one-dimensional angular feature data from these coordinates. Then, the code arranges these feature data in a temporal sequence to form a two-dimensional input vector for the ViT model. To determine the optimal configuration, this study evaluated recognition accuracy by manipulating the number of encoder layers within the ViT model and compared against traditional convolutional neural network (CNN) models to evaluate its effectiveness. The experimental results showed 99.7% accuracy for the method using the ViT model and 99.3% for the results using the CNN. We demonstrated the efficacy of our approach through real-time recognition experiments using Japanese sign language videos.

Funder

Co-G.E.I. (Cooperative Good Educational Innovation) Challenge 2023 of Tokyo Polytechnic University

Publisher

MDPI AG

Link

https://www.mdpi.com/2076-3417/14/8/3228/pdf

Reference17 articles.

1. World Health Organization (2019). Safe Listening Devices and Systems: A WHO-ITU Standard, World Health Organization.

2. Japan Hearing Instruments Manufacturers Association (2022). JapanTrak 2022, JHIMA.

3. Jiang, S., Sun, B., Wang, L., Bai, Y., Li, K., and Fu, Y. (2021, January 19–25). Skeleton Aware Multi-Modal Sign Language Recognition. Proceedings of the 2021 Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA.

4. Hezhen, H., Wengang, Z., and Houqiang, L. (2021, January 2–9). Hand-Model-Aware Sign Language Recognition. Proceedings of the 35th AAAI Conference on Artificial Intelligence, Virtual.

5. Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., and Gelly, S. (2020). An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale. arXiv.