Nonlinear Regularization Decoding Method for Speech Recognition-Reference-Cited by-同舟云学术

Nonlinear Regularization Decoding Method for Speech Recognition

Published:2024-06-14 Issue:12 Volume:24 Page:3846
ISSN:1424-8220
Container-title:Sensors
language:en
Short-container-title:Sensors

Author:

Zhang Jiang¹,Wang Liejun¹^ORCID,Yu Yinfeng¹^ORCID,Xu Miaomiao¹

Affiliation:

1. College of Computer Science and Technology, Xinjiang University, Urumqi 830017, China

Abstract

Existing end-to-end speech recognition methods typically employ hybrid decoders based on CTC and Transformer. However, the issue of error accumulation in these hybrid decoders hinders further improvements in accuracy. Additionally, most existing models are built upon Transformer architecture, which tends to be complex and unfriendly to small datasets. Hence, we propose a Nonlinear Regularization Decoding Method for Speech Recognition. Firstly, we introduce the nonlinear Transformer decoder, breaking away from traditional left-to-right or right-to-left decoding orders and enabling associations between any characters, mitigating the limitations of Transformer architectures on small datasets. Secondly, we propose a novel regularization attention module to optimize the attention score matrix, reducing the impact of early errors on later outputs. Finally, we introduce the tiny model to address the challenge of overly large model parameters. The experimental results indicate that our model demonstrates good performance. Compared to the baseline, our model achieves recognition improvements of 0.12%, 0.54%, 0.51%, and 1.2% on the Aishell1, Primewords, Free ST Chinese Corpus, and Common Voice 16.1 datasets of Uyghur, respectively.

Funder

Tianshan Excellence Program Project of Xinjiang Uygur Autonomous Region

Central Government Guides Local Science and Technology Development Fund Projects

Graduate Research Innovation Project of Xinjiang Uygur Autonomous Region

Publisher

MDPI AG

Link

https://www.mdpi.com/1424-8220/24/12/3846/pdf

Reference44 articles.

1. Audio-visual speech recognition based on regulated transformer and spatio-temporal fusion strategy for driver assistive systems;Ryumin;Expert Syst. Appl.,2024

2. Ryumin, D., Ivanko, D., and Ryumina, E. (2023). Audio-visual speech and gesture recognition by sensors of mobile devices. Sensors, 23.

3. Miao, Z., Liu, H., and Yang, B. (2020, January 11–14). Part-based lipreading for audio-visual speech recognition. Proceedings of the 2020 IEEE International Conference on Systems, Man, and Cybernetics (SMC), Toronto, ON, Canada.

4. A tutorial on hidden Markov models and selected applications in speech recognition;Rabiner;Proc. IEEE,1989

5. Hidden Markov models for speech recognition;Juang;Technometrics,1991