Affiliation:
1. Information Engineering University, Zhengzhou 450001, China
2. Handan Vocational College of Science and Technology, Handan 056000, China
3. Yunnan Minzu University, Kunming 650504, Yunnan, China
Abstract
With the aim of adapting a source Text to Speech (TTS) model to synthesize a personal voice by using a few speech samples from the target speaker, voice cloning provides a specific TTS service. Although the Tacotron 2-based multi-speaker TTS system can implement voice cloning by introducing a d-vector into the speaker encoder, the speaker characteristics described by the d-vector cannot allow for the voice information of the entire utterance. This affects the similarity of voice cloning. As a vocoder, WaveNet sacrifices speech generation speed. To balance the relationship between model parameters, inference speed, and voice quality, a voice cloning method based on improved HiFi-GAN has been proposed in this paper. (1) To improve the feature representation ability of the speaker encoder, the x-vector is used as the embedding vector that can characterize the target speaker. (2) To improve the performance of the HiFi-GAN vocoder, the input Mel spectrum is processed by a competitive multiscale convolution strategy. (3) The one-dimensional depth-wise separable convolution is used to replace all standard one-dimensional convolutions, significantly reducing the model parameters and increasing the inference speed. The improved HiFi-GAN model remarkably reduces the number of vocoder model parameters by about 68.58% and boosts the model’s inference speed. The inference speed on the GPU and CPU has increased by 11.84% and 30.99%, respectively. Voice quality has also been marginally improved as MOS increased by 0.13 and PESQ increased by 0.11. The improved HiFi-GAN model exhibits outstanding performance and remarkable compatibility in the voice cloning task. Combined with the x-vector embedding, the proposed model achieves the highest score of all the models and test sets.
Funder
National Natural Science Foundation of China
Subject
General Mathematics,General Medicine,General Neuroscience,General Computer Science
Reference31 articles.
1. Neural voice cloning with a few samples;S. Arik
2. High quality, lightweight and adaptable TTS using LPCNet;Z. Kons,2019
3. Deep speaker: an end-to-end neural speaker embedding system;C. Li,2017
4. Sample efficient adaptive text-to-speech;Y. Chen,2018
5. Boffin tts: few-shot speaker adaptation by bayesian optimization;H. B. Moss
Cited by
3 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献