Tibetan Speech Synthesis Based on Pre-Traind Mixture Alignment FastSpeech2-Reference-Cited by-同舟云学术

Tibetan Speech Synthesis Based on Pre-Traind Mixture Alignment FastSpeech2

Published:2024-08-05 Issue:15 Volume:14 Page:6834
ISSN:2076-3417
Container-title:Applied Sciences
language:en
Short-container-title:Applied Sciences

Author:

Zhou Qing¹²^ORCID,Xu Xiaona¹²^ORCID,Zhao Yue¹²^ORCID

Affiliation:

1. Key Laboratory of Ethnic Language Intelligent Analysis and Security Governance of MOE, Minzu University of China, Beijing 100081, China

2. School of Information Engineering, Minzu University of China, Beijing 100081, China

Abstract

Most current research in Tibetan speech synthesis relies primarily on autoregressive models in deep learning. However, these models face challenges such as slow inference, skipped readings, and repetitions. To overcome these issues, we propose an enhanced non-autoregressive acoustic model combined with a vocoder for Tibetan speech synthesis. Specifically, we introduce the mixture alignment FastSpeech2 method to correct errors caused by hard alignment in the original FastSpeech2 method. This new method employs soft alignment at the level of Latin letters and hard alignment at the level of Tibetan characters, thereby improving alignment accuracy between text and speech and enhancing the naturalness and intelligibility of the synthesized speech. Additionally, we integrate pitch and energy information into the model, further enhancing overall synthesis quality. Furthermore, Tibetan has relatively smaller text-to-audio datasets compared to widely studied languages. To address these limited resources, we employ a transfer learning approach to pre-train the model with data from resource-rich languages. Subsequently, this pre-trained mixture alignment FastSpeech2 model is fine-tuned for Tibetan speech synthesis. Experimental results demonstrate that the mixture alignment FastSpeech2 model produces higher-quality speech compared to the original FastSpeech2 model, particularly when pre-trained on an English dataset, resulting in further improvements in clarity and naturalness.

Funder

National Natural Science Foundation of China

Publisher

MDPI AG

Link

https://www.mdpi.com/2076-3417/14/15/6834/pdf

Reference24 articles.

1. Current Status and Development Trends of Voice Interaction Technology on Mobile Intelligent Terminals;Yuan;Inf. Commun. Technol.,2014

2. Jia, Y., Zhang, Y., Weiss, R., Wang, Q., Shen, J., Ren, F., Nguyen, P., Pang, R., Lopez Moreno, I., and Wu, Y. (2018, January 3–8). Transfer learning from speaker verification to multispeaker text-to-speech synthesis. Proceedings of the 32nd International Conference on Neural Information Processing Systems, Montreal, QC, Canada.

3. Arık, S.Ö., Chrzanowski, M., Coates, A., Diamos, G., Gibiansky, A., Kang, Y., Li, X., Miller, J., Ng, A., and Raiman, J. (2017, January 6–11). Deep voice: Real-time neural text-to-speech. Proceedings of the International Conference on Machine Learning, Sydney, Australia.

4. Sotelo, J., Mehri, S., Kumar, K., Santos, J.F., Kastner, K., Courville, A., and Bengio, Y. (2017). Char2wav: End-to-End Speech Synthesis, International Speech Communication Association.

5. Donahue, C., McAuley, J., and Puckette, M. (2018). Adversarial audio synthesis. arXiv.