Optimizing Uyghur Speech Synthesis by Combining Pretrained Cross-Lingual Model

Author:

Lu Kexin1ORCID,Huang Zhihua2ORCID,Yin Mingming1ORCID,Chen Ke1ORCID

Affiliation:

1. School of Computer Science and Technology, Xinjiang University, Urumqi, China

2. School of Computer Science and Technology, Xinjiang University, Urumqi, China and Key Laboratory of Signal Detection and Processing in Xinjiang, Xinjiang University, Wulumuqi, China

Abstract

End-to-end speech synthesis methodologies have exhibited considerable advancements for languages with abundant corpus resources. Nevertheless, such achievements are yet to be realized for languages constrained by limited corpora. This manuscript delineates a novel strategy that leverages contextual encoding information to augment the naturalness of the speech synthesized through FastSpeech2, particularly under resource-scarce conditions. Initially, we harness the cross-linguistic model XLM-RoBERTa to extract contextual features, which serve as an auxiliary input to the mel-spectrum decoder of FastSpeech2. Subsequently, we refine the mel-spectrum prediction module to mitigate the overfitting dilemma encountered by FastSpeech2 amidst scant training datasets. To this end, Conformer blocks, rather than traditional Transformer blocks, are employed within both the encoder and decoder to concentrate intensively on varying levels and granularities of feature information. Additionally, we introduce a token-average mechanism to equalize pitch and energy attributes at the frame level. The empirical outcomes indicate that our pre-training with the LJ Speech dataset, followed by fine-tuning using a modest 10-minute paired Uyghur corpus, yields satisfactory synthesized Uyghur speech. Relative to the baseline framework, our proposed technique halves the character error rate and enhances the mean opinion score by over 0.6. Similar results were observed in Mandarin Chinese experimental evaluations.

Funder

Natural Science Foundation of Xinjiang Uygur Autonomous Region of China

National Key R&D Program of China

Publisher

Association for Computing Machinery (ACM)

Reference25 articles.

1. Jose Sotelo Soroush Mehri Kundan Kumar Joao Felipe Santos Kyle Kastner Aaron Courville and Yoshua Bengio. 2017. Char2Wav: End-to-end speech synthesis. (April 2017).

2. Yuxuan Wang R. J. Skerry-Ryan Daisy Stanton Yonghui Wu Ron J. Weiss Navdeep Jaitly Zongheng Yang Ying Xiao Zhifeng Chen Samy Bengio Quoc Le Yannis Agiomyrgiannakis Rob Clark and Rif A. Saurous. 2017. Tacotron: Towards End-to-End Speech Synthesis. (April 2017). DOI:10.48550/arXiv.1703.10135arxiv:1703.10135

3. Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions

4. Yi Ren, Yangjun Ruan, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu. 2019. FastSpeech: Fast, robust and controllable text to speech. In Advances in Neural Information Processing Systems, Vol. 32. Curran Associates, Inc.

5. Yi Ren, Chenxu Hu, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu. 2022. FastSpeech 2: Fast and high-quality end-to-end text to speech. In International Conference on Learning Representations.

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3