Abstract
In this paper, we dive into sign language recognition, focusing on the recognition of isolated signs. The task is defined as a classification problem, where a sequence of frames (i.e., images) is recognized as one of the given sign language glosses. We analyze two appearance-based approaches, I3D and TimeSformer, and one pose-based approach, SPOTER. The appearance-based approaches are trained on a few different data modalities, whereas the performance of SPOTER is evaluated on different types of preprocessing. All the methods are tested on two publicly available datasets: AUTSL and WLASL300. We experiment with ensemble techniques to achieve new state-of-the-art results of 73.84% accuracy on the WLASL300 dataset by using the CMA-ES optimization method to find the best ensemble weight parameters. Furthermore, we present an ensembling technique based on the Transformer model, which we call Neural Ensembler.
Funder
European Regional Development Fund
Technology Agency of the Czech Republic
Subject
Electrical and Electronic Engineering,Biochemistry,Instrumentation,Atomic and Molecular Physics, and Optics,Analytical Chemistry
Reference51 articles.
1. ImageNet classification with deep convolutional neural networks
2. Very deep convolutional networks for large-scale image recognition;Simonyan;arXiv,2014
3. Deep residual learning for image recognition;He;Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,2016
4. Deep Sign: Hybrid CNN-HMM for Continuous Sign Language Recognition;Koller;Proceedings of the British Machine Vision Conference 2016,2016
5. Recognizing American Sign Language Gestures from Within Continuous Videos
Cited by
21 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献