Affiliation:
1. School of Computer Science and Technology, Xinjiang University, Urumqi, China
2. School of Computer Science and Technology, Xinjiang University, Urumqi, China and Key Laboratory of Signal Detection and Processing in Xinjiang, Xinjiang University, Wulumuqi, China
Abstract
End-to-end speech synthesis methodologies have exhibited considerable advancements for languages with abundant corpus resources. Nevertheless, such achievements are yet to be realized for languages constrained by limited corpora. This manuscript delineates a novel strategy that leverages contextual encoding information to augment the naturalness of the speech synthesized through FastSpeech2, particularly under resource-scarce conditions. Initially, we harness the cross-linguistic model XLM-RoBERTa to extract contextual features, which serve as an auxiliary input to the mel-spectrum decoder of FastSpeech2. Subsequently, we refine the mel-spectrum prediction module to mitigate the overfitting dilemma encountered by FastSpeech2 amidst scant training datasets. To this end, Conformer blocks, rather than traditional Transformer blocks, are employed within both the encoder and decoder to concentrate intensively on varying levels and granularities of feature information. Additionally, we introduce a token-average mechanism to equalize pitch and energy attributes at the frame level. The empirical outcomes indicate that our pre-training with the LJ Speech dataset, followed by fine-tuning using a modest 10-minute paired Uyghur corpus, yields satisfactory synthesized Uyghur speech. Relative to the baseline framework, our proposed technique halves the character error rate and enhances the mean opinion score by over 0.6. Similar results were observed in Mandarin Chinese experimental evaluations.
Funder
Natural Science Foundation of Xinjiang Uygur Autonomous Region of China
National Key R&D Program of China
Publisher
Association for Computing Machinery (ACM)
Reference25 articles.
1. Jose Sotelo Soroush Mehri Kundan Kumar Joao Felipe Santos Kyle Kastner Aaron Courville and Yoshua Bengio. 2017. Char2Wav: End-to-end speech synthesis. (April 2017).
2. Yuxuan Wang R. J. Skerry-Ryan Daisy Stanton Yonghui Wu Ron J. Weiss Navdeep Jaitly Zongheng Yang Ying Xiao Zhifeng Chen Samy Bengio Quoc Le Yannis Agiomyrgiannakis Rob Clark and Rif A. Saurous. 2017. Tacotron: Towards End-to-End Speech Synthesis. (April 2017). DOI:10.48550/arXiv.1703.10135arxiv:1703.10135
3. Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions
4. Yi Ren, Yangjun Ruan, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu. 2019. FastSpeech: Fast, robust and controllable text to speech. In Advances in Neural Information Processing Systems, Vol. 32. Curran Associates, Inc.
5. Yi Ren, Chenxu Hu, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu. 2022. FastSpeech 2: Fast and high-quality end-to-end text to speech. In International Conference on Learning Representations.