Affiliation:
1. Harbin Institute of Technology Shenzhen, Shenzhen, China
Abstract
Voice cloning in text-to-speech (TTS) is the process of replicating the voice of a target speaker with limited data. Among various voice cloning techniques, this article focuses on zero-shot voice cloning. Although existing TTS models can generate high-quality speech for seen speakers, cloning the voice of an unseen speaker remains a challenging task. The key aspect of zero-shot voice cloning is to obtain a speaker embedding from the target speaker. Previous works have used a speaker encoder to obtain a fixed-size speaker embedding from a single reference audio unsupervised, but they suffer from insufficient speaker information and content information leakage in speaker embedding. To address these issues, this article proposes MRMI-TTS, a FastSpeech2-based framework that uses speaker embedding as a conditioning variable to provide speaker information. The MRMI-TTS extracts speaker embedding and content embedding from multi-reference audios using a speaker encoder and a content encoder. To obtain sufficient speaker information, multi-reference audios are selected based on sentence similarity. The proposed model applies mutual information minimization on the two embeddings to remove entangled information within each embedding. Experiments on the public English dataset VCTK show that our method can improve synthesized speech in terms of both similarity and naturalness, even for unseen speakers. Compared to state-of-the-art reference embedding learned methods, our method achieves the best performance on the zero-shot voice cloning task. Furthermore, we demonstrate that the proposed method has a better capability of maintaining the speaker embedding in different languages. Sample outputs are available on the demo page.
1
Publisher
Association for Computing Machinery (ACM)
Reference45 articles.
1. Neural voice cloning with a few samples;Arik Sercan;Proceedings of the 32nd International Conference on Neural Information Processing Systems,2018
2. Yanyao Bian Changbin Chen Yongguo Kang and Zhenglin Pan. 2019. Multi-reference Tacotron by intercross training for style disentangling transfer and control in speech synthesis. arXiv:1904.02373. Retrieved from https://arxiv.org/abs/1904.02373
3. Weicheng Cai Jinkun Chen and Ming Li. 2018. Exploring the encoding layer and loss function in end-to-end speaker and language recognition system. In Proceedings of the Speaker and Language Recognition Workshop (Odyssey’18) 74–81. Retrieved from https://www.isca-archive.org/odyssey_2018/cai18_odyssey.html
4. Zexin Cai Chuxiong Zhang and Ming Li. 2020. From speaker verification to multispeaker speech synthesis deep transfer with feedback constraint. In Proceedings of Interspeech 3974–3978. Retrieved from https://www.isca-archive.org/interspeech_2020/cai20c_interspeech.html
5. Edresson Casanova Christopher Shulby Eren Gölge Nicolas Michael Müller Frederico Santos de Oliveira Arnaldo Candido Junior Anderson da Silva Soares Sandra Maria Aluisio and Moacir Antonelli Ponti. 2021. SC-GlowTTS: an efficient zero-shot multi-speaker text-to-speech model. In Proceedings of Interspeech 3645–3949. Retrieved from https://www.isca-archive.org/interspeech_2021/casanova21b_interspeech.pdf