Comparative Study for Multi-Speaker Mongolian TTS with a New Corpus
-
Published:2023-03-27
Issue:7
Volume:13
Page:4237
-
ISSN:2076-3417
-
Container-title:Applied Sciences
-
language:en
-
Short-container-title:Applied Sciences
Author:
Liang Kailin1, Liu Bin1, Hu Yifan1, Liu Rui1, Bao Feilong1, Gao Guanglai1
Affiliation:
1. College of Computer Science, Inner Mongolia University, Hohhot 010031, China
Abstract
Low-resource text-to-speech synthesis is a very promising research direction. Mongolian is the official language of the Inner Mongolia Autonomous Region and is spoken by more than 10 million people worldwide. Mongolian, as a representative low-resource language, has a relative lack of open-source datasets for its TTS. Therefore, we make public an open-source multi-speaker Mongolian TTS dataset, named MnTTS2, for related researchers. In this work, we invited three Mongolian announcers to record topic-rich speeches. Each announcer recorded 10 h of Mongolian speech, and the whole dataset was 30 h in total. In addition, we built two baseline systems based on state-of-the-art neural architectures, including a multi-speaker Fastspeech 2 model with HiFi-GAN vocoder and a full end-to-end VITS model for multi-speakers. On the system of FastSpeech2+HiFi-GAN, the three speakers scored 4.0 or higher on both naturalness evaluation and speaker similarity. In addition, the three speakers achieved scores of 4.5 or higher on the VITS model for naturalness evaluation and speaker similarity scores. The experimental results show that the published MnTTS2 dataset can be used to build robust Mongolian multi-speaker TTS models.
Funder
High-level Talents Introduction Project of Inner Mongolia University Young Scientists Fund of the National Natural Science Foundation of China
Subject
Fluid Flow and Transfer Processes,Computer Science Applications,Process Chemistry and Technology,General Engineering,Instrumentation,General Materials Science
Reference34 articles.
1. Shen, J., Pang, R., Weiss, R.J., Schuster, M., Jaitly, N., Yang, Z., Chen, Z., Zhang, Y., Wang, Y., and Skerrv-Ryan, R. (2018, January 15–20). Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada. 2. Charpentier, F., and Stella, M. (1986, January 7–11). Diphone synthesis using an overlap-add technique for speech waveforms concatenation. Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP ’86), Tokyo, Japan. 3. A tutorial on hidden Markov models and selected applications in speech recognition;Rabiner;Proc. IEEE,1989 4. Cho, K., Van Merriënboer, B., Bahdanau, D., and Bengio, Y. (2014). On the properties of neural machine translation: Encoder-decoder approaches. arXiv. 5. Wang, Y., Skerry-Ryan, R., Stanton, D., Wu, Y., Weiss, R.J., Jaitly, N., Yang, Z., Xiao, Y., Chen, Z., and Bengio, S. (2017, January 20–24). Tacotron: Towards End-to-End Speech Synthesis. Proceedings of the Interspeech 2017, Stockholm, Sweden.
|
|