Research on a Mongolian Text to Speech Model Based on Ghost and ILPCnet

Author:

Ren Qing-Dao-Er-Ji1ORCID,Wang Lele1ORCID,Zhang Wenjing1,Li Leixiao2

Affiliation:

1. School of Information Engineering, Inner Mongolia University of Technology, Hohhot 010051, China

2. College of Data Science and Application, Inner Mongolia University of Technology, Hohhot 010051, China

Abstract

The core challenge of speech synthesis technology is how to convert text information into an audible audio form to meet the needs of users. In recent years, the quality of speech synthesis based on end-to-end speech synthesis models has been significantly improved. However, due to the characteristics of the Mongolian language and the lack of an audio corpus, the Mongolian speech synthesis model has achieved few results, and there are still some problems with the performance and synthesis quality. First, the phoneme information of Mongolian was further improved and a Bang-based pre-training model was constructed to reduce the error rate of Mongolian phonetic synthesized words. Second, a Mongolian speech synthesis model based on Ghost and ILPCnet was proposed, named the Ghost-ILPCnet model, which was improved based on the Para-WaveNet acoustic model, replacing ordinary convolution blocks with stacked Ghost modules to generate Mongolian acoustic features in parallel and improve the speed of speech generation. At the same time, the improved vocoder ILPCnet had a high synthesis quality and low complexity compared to other vocoders. Finally, a large number of data experiments were conducted on the proposed model to verify its effectiveness. The experimental results show that the Ghost-ILPCnet model has a simple structure, fewer model generation parameters, fewer hardware requirements, and can be trained in parallel. The average subjective opinion score of its synthesized speech reached 4.48 and the real-time rate reached 0.0041. It ensures the naturalness and clarity of synthesized speech, speeds up the synthesis speed, and effectively improves the performance of the Mongolian speech synthesis model.

Funder

National Natural Science Foundation of China

Inner Mongolia Natural Science Foundation

Inner Mongolia Science and Technology Program Project

Support Program for Young Scientific and Technological Talents in Inner Mongolia Colleges and Universities

Fundamental Research Fund Project

Basic scientific research business expenses of universities directly in the Inner Mongolia Autonomous Region

Publisher

MDPI AG

Reference27 articles.

1. Review of Text-to-speech Conversion for English;Klatt;J. Acoust. Soc. Am.,1987

2. Reducing the Dimensionality of Data with Neural Networks;Hinton;Science,2006

3. Oord, A., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A., and Kavukcuoglu, K. (2016). Wavenet: A Generative Model for Raw Audio. arXiv.

4. Sotelo, J., Mehri, S., Kumar, K., Santos, J.F., Kastner, K., Courville, A., and Bengio, Y. (2017, January 24–26). Char2wav: End-to-end Speech Synthesis. Proceedings of the 5th International Conference on Learning Representations, Toulon, France.

5. Mehri, S., Kumar, K., Gulrajani, I., Kumar, R., Jain, S., Sotelo, J., Courville, A., and Bengio, Y. (2016). SampleRNN: An Unconditional End-to-End Neural Audio Generation Model. arXiv.

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3