Improving Hybrid CTC/Attention Architecture for Agglutinative Language Speech Recognition-Reference-Cited by-同舟云学术

Improving Hybrid CTC/Attention Architecture for Agglutinative Language Speech Recognition

Published:2022-09-27 Issue:19 Volume:22 Page:7319
ISSN:1424-8220
Container-title:Sensors
language:en
Short-container-title:Sensors

Author:

Ren Zeyu,Yolwas Nurmemet,Slamu Wushour,Cao Ronghe,Wang Huiru

Abstract

Unlike the traditional model, the end-to-end (E2E) ASR model does not require speech information such as a pronunciation dictionary, and its system is built through a single neural network and obtains performance comparable to that of traditional methods. However, the model requires massive amounts of training data. Recently, hybrid CTC/attention ASR systems have become more popular and have achieved good performance even under low-resource conditions, but they are rarely used in Central Asian languages such as Turkish and Uzbek. We extend the dataset by adding noise to the original audio and using speed perturbation. To develop the performance of an E2E agglutinative language speech recognition system, we propose a new feature extractor, MSPC, which uses different sizes of convolution kernels to extract and fuse features of different scales. The experimental results show that this structure is superior to VGGnet. In addition to this, the attention module is improved. By using the CTC objective function in training and the BERT model to initialize the language model in the decoding stage, the proposed method accelerates the convergence of the model and improves the accuracy of speech recognition. Compared with the baseline model, the character error rate (CER) and word error rate (WER) on the LibriSpeech test-other dataset increases by 2.42% and 2.96%, respectively. We apply the model structure to the Common Voice—Turkish (35 h) and Uzbek (78 h) datasets, and the WER is reduced by 7.07% and 7.08%, respectively. The results show that our method is close to the advanced E2E systems.

Funder

National Natural Science Foundation of China

the National Language Commission key Project

Publisher

MDPI AG

Subject

Electrical and Electronic Engineering,Biochemistry,Instrumentation,Atomic and Molecular Physics, and Optics,Analytical Chemistry

Link

https://www.mdpi.com/1424-8220/22/19/7319/pdf

Reference72 articles.

1. Advancing acoustic-to-word CTC model;Li;Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP),2018

2. Context-aware transformer transducer for speech recognition;Chang;Proceedings of the 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU),2021

3. State-of-the-art speech recognition with sequence-to-sequence models;Chiu;Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP),2018

4. On the comparison of popular end-to-end models for large scale speech recognition;Li;arXiv,2020

5. A review of on-device fully neural end-to-end automatic speech recognition algorithms;Kim;Proceedings of the 2020 54th Asilomar Conference on Signals, Systems, and Computers,2020

Cited by 10 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Analyzing the Regularization Technique in Automatic Speech Recognition System Using Sepedi-English Code-Switched Data;2024 International Conference on Artificial Intelligence, Big Data, Computing and Data Communication Systems (icABCD);2024-08-01

2. Integrated End-to-End Automatic Speech Recognition for Languages for Agglutinative Languages;ACM Transactions on Asian and Low-Resource Language Information Processing;2024-06-21

3. Customized deep learning based Turkish automatic speech recognition system supported by language model;PeerJ Computer Science;2024-04-03

4. Neurorecognition visualization in multitask end-to-end speech;Optical Fibers and Their Applications 2023;2023-12-20

5. Voice-Controlled Intelligent Personal Assistant for Call-Center Automation in the Uzbek Language;Electronics;2023-11-30