JSUM: A Multitask Learning Speech Recognition Model for Jointly Supervised and Unsupervised Learning-Reference-Cited by-同舟云学术

JSUM: A Multitask Learning Speech Recognition Model for Jointly Supervised and Unsupervised Learning

Published:2023-04-22 Issue:9 Volume:13 Page:5239
ISSN:2076-3417
Container-title:Applied Sciences
language:en
Short-container-title:Applied Sciences

Author:

Yolwas Nurmemet¹²,Meng Weijing¹²^ORCID

Affiliation:

1. Xinjiang Multilingual Information Technology Laboratory, Urumqi 830017, China

2. College of Information Science and Engineering, Xinjiang University, Urumqi 830017, China

Abstract

In recent years, the end-to-end speech recognition model has emerged as a popular alternative to the traditional Deep Neural Network—Hidden Markov Model (DNN-HMM). This approach maps acoustic features directly onto text sequences via a single network architecture, significantly streamlining the model construction process. However, the training of end-to-end speech recognition models typically necessitates a significant quantity of supervised data to achieve good performance, which poses a challenge in low-resource conditions. The use of unsupervised representation significantly reduces this necessity. Recent research has focused on end-to-end techniques employing joint Connectionist Temporal Classification (CTC) and attention mechanisms, with some also concentrating on unsupervised presentation learning. This paper proposes a joint supervised and unsupervised multi-task learning model (JSUM). Our approach leverages the unsupervised pre-trained wav2vec 2.0 model as a shared encoder that integrates the joint CTC-Attention network and the generative adversarial network into a unified end-to-end architecture. Our method provides a new low-resource language speech recognition solution that optimally utilizes supervised and unsupervised datasets by combining CTC, attention, and generative adversarial losses. Furthermore, our proposed approach is suitable for both monolingual and cross-lingual scenarios.

Funder

National Natural Science Foundation of China—Research on Key Technologies of Speech Recognition of Chinese and Western Asian Languages under Resource Constraints

National Language Commission Key Project of China—Research on Speech Keyword Search Technology of Chinese and Western Asian Languages

Publisher

MDPI AG

Subject

Fluid Flow and Transfer Processes,Computer Science Applications,Process Chemistry and Technology,General Engineering,Instrumentation,General Materials Science

Link

https://www.mdpi.com/2076-3417/13/9/5239/pdf

Reference37 articles.

1. Zhang, Q., Lu, H., Sak, H., Tripathi, A., McDermott, E., Koo, S., and Kumar, S. (2020, January 4–8). Transformer transducer: A streamable speech recognition model with transformer encoders and rnn-t loss. Proceedings of the ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain.

2. Moritz, N., Hori, T., and Le, J. (2020, January 4–8). Streaming automatic speech recognition with the transformer model. Proceedings of the ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain.

3. Kahn, J., Lee, A., and Hannun, A. (2020, January 4–8). Self-training for end-to-end speech recognition. Proceedings of the ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain.

4. Graves, A., and Jaitly, N. (2014, January 21–26). Towards end-to-end speech recognition with recurrent neural networks. Proceedings of the International Conference on Machine Learning, Beijing, China.

5. Miao, Y., Gowayyed, M., and Metze, F. (2015, January 13–17). EESEN: End-to-end speech recognition using deep RNN models and WFST-based decoding. Proceedings of the 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), Scottsdale, AZ, USA.

Cited by 2 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Implementation of an Automatic Meeting Minute Generation System Using YAMNet with Speaker Identification and Keyword Prompts;Applied Sciences;2024-06-29

2. Fine-grained image emotion captioning based on Generative Adversarial Networks;Multimedia Tools and Applications;2024-03-08