Affiliation:
1. Rochester Institute of Technology, Rochester, New York
Abstract
Many sign languages are bona fide natural languages with grammatical rules and lexicons hence can benefit from machine translation methods. Similarly, since sign language is a visual-spatial language, it can also benefit from computer vision methods for encoding it. With the advent of deep learning methods in recent years, significant advances have been made in natural language processing (specifically neural machine translation) and in computer vision methods (specifically image and video captioning). Researchers have therefore begun expanding these learning methods to sign language understanding. Sign language interpretation is especially challenging, because it involves a continuous visual-spatial modality where meaning is often derived based on context.
The focus of this article, therefore, is to examine various deep learning–based methods for encoding sign language as inputs, and to analyze the efficacy of several machine translation methods, over three different sign language datasets. The goal is to determine which combinations are sufficiently robust for sign language translation
without
any gloss-based information.
To understand the role of the different input features, we perform ablation studies over the model architectures (input features + neural translation models) for improved continuous sign language translation. These input features include body and finger joints, facial points, as well as vector representations/embeddings from convolutional neural networks. The machine translation models explored include several baseline sequence-to-sequence approaches, more complex and challenging networks using attention, reinforcement learning, and the transformer model. We implement the translation methods over multiple sign languages—German (GSL), American (ASL), and Chinese sign languages (CSL). From our analysis, the transformer model combined with input embeddings from ResNet50 or pose-based landmark features outperformed all the other sequence-to-sequence models by achieving higher BLEU2-BLEU4 scores when applied to the controlled and constrained GSL benchmark dataset. These combinations also showed significant promise on the other less controlled ASL and CSL datasets.
Funder
National Science Foundation
Publisher
Association for Computing Machinery (ACM)
Subject
Computer Science Applications,Human-Computer Interaction
Reference113 articles.
1. Biyi Fang Jillian Co and Mi Zhang. 2017. DeepASL: Enabling ubiquitous and non-intrusive word and sentence-level sign language translation. In Proceedings of the 15th ACM Conference on Embedded Network Sensor Systems . 1–13.
2. Openpose 2D. 2018. Retrieved January 4 2020 from https://github.com/CMU-Perceptual-Computing-Lab/openpose.
3. Nayyer Aafaq Syed Zulqarnain Gilani Wei Liu and Ajmal Mian. 2018. Video description: A survey of methods datasets and evaluation metrics. ACM Comput. Surv. 52 6 Article 115 (Oct. 2019) 37 pages. https://doi.org/10.1145/3355390
4. Dzmitry Bahdanau Kyunghyun Cho and Yoshua Bengio. 2015. Neural Machine Translation by Jointly Learning to Align and Translate. In Proceedings of the 3rd International Conference on Learning Representations (ICLR’15) San Diego CA USA May 7-9 2015 Yoshua Bengio and Yann LeCun (Eds.). http://arxiv.org/abs/1409.0473
5. Dzmitry Bahdanau Kyunghyun Cho and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. In Proceedings of the 3rd International Conference on Learning Representations ICLR 2015 San Diego CA USA May 7-9 2015 Yoshua Bengio and Yann LeCun (Eds.). http://arxiv.org/abs/1409.0473
Cited by
19 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献