Affiliation:
1. Department of Computer Engineering, Chosun University, Gwangju 61452, Republic of Korea
2. Glosori Inc., Gwangju 61472, Republic of Korea
Abstract
Speech synthesis is a technology that converts text into speech waveforms. With the development of deep learning, neural network-based speech synthesis technology is being researched in various fields, and the quality of synthesized speech has significantly improved. In particular, Grad-TTS, a speech synthesis model based on the denoising diffusion probabilistic model (DDPM), exhibits high performance in various domains, generates high-quality speech, and supports multi-speaker speech synthesis. However, speech synthesis for an unseen speaker is not possible. Therefore, this study proposes an effective zero-shot multi-speaker speech synthesis model that improves the Grad-TTS structure. The proposed method enables the reception of speaker information from speech references using a pre-trained speaker recognition model. In addition, by converting speaker information via information perturbation, the model can learn various types of speaker information, excluding those in the dataset. To evaluate the performance of the proposed method, we measured objective performance indicators, namely speaker encoder cosine similarity (SECS) and mean opinion score (MOS). To evaluate the synthesis performance for both the seen speaker and unseen speaker scenarios, Grad-TTS, SC-GlowTTS, and YourTTS were compared. The results demonstrated excellent speech synthesis performance for seen speakers and a performance similar to that of the zero-shot multi-speaker speech synthesis model.
Subject
Electrical and Electronic Engineering,Biochemistry,Instrumentation,Atomic and Molecular Physics, and Optics,Analytical Chemistry
Reference32 articles.
1. Nazir, O., and Malik, A. (2021, January 21–23). Deep learning end to end speech synthesis: A review. Proceedings of the 2021 2nd International Conference on Secure Cyber Computing and Communications (ICSCCC), Jalandhar, India.
2. ASVspoof: The automatic speaker verification spoofing and countermeasures challenge;Wu;IEEE J. Sel. Top. Signal Process.,2017
3. A fast high-fidelity source-filter vocoder with lightweight neural modules;Yang;IEEE/ACM Trans. Audio Speech Lang. Process.,2023
4. Attention is all you need;Vaswani;Adv. Neural Inf. Process. Syst. (NeurIPS),2017
5. Wang, Y., Skerry-Ryan, R.J., Stanton, D., Wu, Y., Weiss, R.J., Jaitly, N., Yang, Z., Xiao, Y., Chen, Z., and Bengio, S. (2017). Tacotron: Towards end-to-end speech synthesis. arXiv.
Cited by
1 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献