Affiliation:
1. School of Computer Science & Information Engineering, Shanghai Institute of Technology, Shanghai, China
Abstract
Vision-based Continuous Sign Language Recognition (CSLR) is a challenging and weakly supervised task aimed at segmenting sign language from weakly annotated image stream sequences for recognition. Compared with Isolated Sign Language Recognition (ISLR), the biggest challenge of this work is that the image stream sequences have ambiguous time boundaries. Recent CSLR works have shown that the visual-level sign language recognition task focuses on image stream feature extraction and feature alignment, and overfitting is the most critical problem in the CSLR training process. After investigating the advanced CSLR models in recent years, we have identified that the key to this study is the adequate training of the feature extractor. Therefore, this paper proposes a CSLR model with Multi-state Feature Optimization (MFO), which is based on Fully Convolutional Network (FCN) and Connectionist Temporal Classification (CTC). The MFO mechanism supervises the multiple states of each Sign Gloss in the modeling process and provides more refined labels for training the CTC decoder, which can effectively solve the overfitting problem caused by training, while also significantly reducing the training cost in time. We validate the MFO method on the popular CSLR dataset and demonstrate that the model has better performance.
Subject
Artificial Intelligence,General Engineering,Statistics and Probability