The Development of a Kazakh Speech Recognition Model Using a Convolutional Neural Network with Fixed Character Level Filters

Author:

Kadyrbek Nurgali1,Mansurova Madina1,Shomanov Adai2,Makharova Gaukhar3

Affiliation:

1. Department of AI & Big Data, Faculty of Information Technologies, Al-Farabi Kazakh National University, Al-Farabi Ave., 71, Almaty 050040, Kazakhstan

2. School of Engineering and Digital Sciences, Nazarbayev University, Kabanbai Batyr Ave., 53, Astana 010000, Kazakhstan

3. Department of Foreign Language, Faculty of Philology, Al-Farabi Kazakh National University, Almaty 050040, Kazakhstan

Abstract

This study is devoted to the transcription of human speech in the Kazakh language in dynamically changing conditions. It discusses key aspects related to the phonetic structure of the Kazakh language, technical considerations in collecting the transcribed audio corpus, and the use of deep neural networks for speech modeling. A high-quality decoded audio corpus was collected, containing 554 h of data, giving an idea of the frequencies of letters and syllables, as well as demographic parameters such as the gender, age, and region of residence of native speakers. The corpus contains a universal vocabulary and serves as a valuable resource for the development of modules related to speech. Machine learning experiments were conducted using the DeepSpeech2 model, which includes a sequence-to-sequence architecture with an encoder, decoder, and attention mechanism. To increase the reliability of the model, filters initialized with symbol-level embeddings were introduced to reduce the dependence on accurate positioning on object maps. The training process included simultaneous preparation of convolutional filters for spectrograms and symbolic objects. The proposed approach, using a combination of supervised and unsupervised learning methods, resulted in a 66.7% reduction in the weight of the model while maintaining relative accuracy. The evaluation on the test sample showed a 7.6% lower character error rate (CER) compared to existing models, demonstrating its most modern characteristics. The proposed architecture provides deployment on platforms with limited resources. Overall, this study presents a high-quality audio corpus, an improved speech recognition model, and promising results applicable to speech-related applications and languages beyond Kazakh.

Funder

Committee of Science of the Republic of Kazakhstan

Publisher

MDPI AG

Subject

Artificial Intelligence,Computer Science Applications,Information Systems,Management Information Systems

Reference25 articles.

1. Automatic speech recognition: A survey;Malik;Multimed. Tools Appl.,2021

2. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups;Hinton;IEEE Signal Process. Mag.,2012

3. Ryssaldy, K. (2015). Kazakh in Post-Soviet Kazakhstan, Harrassowitz Verlag.

4. Inventory of Phonemes in Kazakh Language;Badanbekkyzy;Int. J. Res. Humanit. Arts Lit. (IMPACT:IJRHAL),2014

5. Kazakh;McCollum;J. Int. Phon. Assoc.,2020

Cited by 1 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

1. Research on acoustic Model of Putian Dialect Speech Recognition Based on Deep Learning;Proceedings of the 2024 International Conference on Generative Artificial Intelligence and Information Security;2024-05-10

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3