Effective Zero-Shot Multi-Speaker Text-to-Speech Technique Using Information Perturbation and a Speaker Encoder-Reference-Cited by-同舟云学术

Effective Zero-Shot Multi-Speaker Text-to-Speech Technique Using Information Perturbation and a Speaker Encoder

Published:2023-12-03 Issue:23 Volume:23 Page:9591
ISSN:1424-8220
Container-title:Sensors
language:en
Short-container-title:Sensors

Author:

Bang Chae-Woon¹²,Chun Chanjun¹²^ORCID

Affiliation:

1. Department of Computer Engineering, Chosun University, Gwangju 61452, Republic of Korea

2. Glosori Inc., Gwangju 61472, Republic of Korea

Abstract

Speech synthesis is a technology that converts text into speech waveforms. With the development of deep learning, neural network-based speech synthesis technology is being researched in various fields, and the quality of synthesized speech has significantly improved. In particular, Grad-TTS, a speech synthesis model based on the denoising diffusion probabilistic model (DDPM), exhibits high performance in various domains, generates high-quality speech, and supports multi-speaker speech synthesis. However, speech synthesis for an unseen speaker is not possible. Therefore, this study proposes an effective zero-shot multi-speaker speech synthesis model that improves the Grad-TTS structure. The proposed method enables the reception of speaker information from speech references using a pre-trained speaker recognition model. In addition, by converting speaker information via information perturbation, the model can learn various types of speaker information, excluding those in the dataset. To evaluate the performance of the proposed method, we measured objective performance indicators, namely speaker encoder cosine similarity (SECS) and mean opinion score (MOS). To evaluate the synthesis performance for both the seen speaker and unseen speaker scenarios, Grad-TTS, SC-GlowTTS, and YourTTS were compared. The results demonstrated excellent speech synthesis performance for seen speakers and a performance similar to that of the zero-shot multi-speaker speech synthesis model.

Publisher

MDPI AG

Subject

Electrical and Electronic Engineering,Biochemistry,Instrumentation,Atomic and Molecular Physics, and Optics,Analytical Chemistry

Link

https://www.mdpi.com/1424-8220/23/23/9591/pdf

Reference32 articles.

1. Nazir, O., and Malik, A. (2021, January 21–23). Deep learning end to end speech synthesis: A review. Proceedings of the 2021 2nd International Conference on Secure Cyber Computing and Communications (ICSCCC), Jalandhar, India.

2. ASVspoof: The automatic speaker verification spoofing and countermeasures challenge;Wu;IEEE J. Sel. Top. Signal Process.,2017

3. A fast high-fidelity source-filter vocoder with lightweight neural modules;Yang;IEEE/ACM Trans. Audio Speech Lang. Process.,2023

4. Attention is all you need;Vaswani;Adv. Neural Inf. Process. Syst. (NeurIPS),2017

5. Wang, Y., Skerry-Ryan, R.J., Stanton, D., Wu, Y., Weiss, R.J., Jaitly, N., Yang, Z., Xiao, Y., Chen, Z., and Bengio, S. (2017). Tacotron: Towards end-to-end speech synthesis. arXiv.

Cited by 1 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Harnessing AI and NLP Tools for Innovating Brand Name Generation and Evaluation: A Comprehensive Review;Multimodal Technologies and Interaction;2024-07-01