Semi-Supervised Learning for Robust Emotional Speech Synthesis with Limited Data
-
Published:2023-05-06
Issue:9
Volume:13
Page:5724
-
ISSN:2076-3417
-
Container-title:Applied Sciences
-
language:en
-
Short-container-title:Applied Sciences
Author:
Zhang Jialin1, Wushouer Mairidan1ORCID, Tuerhong Gulanbaier1ORCID, Wang Hanfang1
Affiliation:
1. Xinjiang Multilingual Information Technology Laboratory, Xinjiang Multilingual Information Technology Research Center, College of Information Science and Engineering, Xinjiang University, Urumqi 830017, China
Abstract
Emotional speech synthesis is an important branch of human–computer interaction technology that aims to generate emotionally expressive and comprehensible speech based on the input text. With the rapid development of speech synthesis technology based on deep learning, the research of affective speech synthesis has gradually attracted the attention of scholars. However, due to the lack of quality emotional speech synthesis corpus, emotional speech synthesis research under low-resource conditions is prone to overfitting, exposure error, catastrophic forgetting and other problems leading to unsatisfactory generated speech results. In this paper, we proposed an emotional speech synthesis method that integrates migration learning, semi-supervised training and robust attention mechanism to achieve better adaptation to the emotional style of the speech data during fine-tuning. By adopting an appropriate fine-tuning strategy, trade-off parameter configuration and pseudo-labels in the form of loss functions, we efficiently guided the learning of the regularized synthesis of emotional speech. The proposed SMAL-ET2 method outperforms the baseline methods in both subjective and objective evaluations. It is demonstrated that our training strategy with stepwise monotonic attention and semi-supervised loss method can alleviate the overfitting phenomenon and improve the generalization ability of the text-to-speech model. Our method can also enable the model to successfully synthesize different categories of emotional speech with better naturalness and emotion similarity.
Funder
Natural Science Foundation of Autonomous Region National Natural Science Foundation of China the Natural Science Foundation of Au-tonomous Region Autonomous Region High-Level Innovative Talent Project
Subject
Fluid Flow and Transfer Processes,Computer Science Applications,Process Chemistry and Technology,General Engineering,Instrumentation,General Materials Science
Reference41 articles.
1. Oord, A.V.D., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A., and Kavukcuoglu, K. (2016). WaveNet: A Generative Model for Raw Audio. arXiv. 2. Wang, Y., Skerry-Ryan, R.J., Stanton, D., Wu, Y., Weiss, R.J., Jaitly, N., Yang, Z., Xiao, Y., Chen, Z., and Bengio, S. (2017). Tacotron: Towards End-to-End Speech Synthesis. arXiv. 3. Shen, J., Pang, R., Weiss, R.J., Schuster, M., Jaitly, N., Yang, Z., Chen, Z., Zhang, Y., Wang, Y., and Skerry-Ryan, R.J. (2018, January 15–20). Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions. Proceedings of the Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions 2018, Calgary, AB, Canada. 4. Arık, S.Ö., Chrzanowski, M., Coates, A., Diamos, G., Gibiansky, A., Kang, Y., Li, X., Miller, J., Ng, A., and Raiman, J. (2017, January 6–11). Deep Voice: Real-Time Neural Text-to-Speech. Proceedings of the 34th International Conference on Machine Learning, PMLR, Sydney, Australia. 5. Arik, S., Diamos, G., Gibiansky, A., Miller, J., Peng, K., Ping, W., Raiman, J., and Zhou, Y. (2017). Deep Voice 2: Multi-Speaker Neural Text-to-Speech. arXiv.
Cited by
2 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
|
|