MRMI-TTS: Multi-Reference Audios and Mutual Information Driven Zero-Shot Voice Cloning

Author:

Chen Yi Ting1ORCID,Li Wanting1ORCID,Tang Buzhou1ORCID

Affiliation:

1. Harbin Institute of Technology Shenzhen, Shenzhen, China

Abstract

Voice cloning in text-to-speech (TTS) is the process of replicating the voice of a target speaker with limited data. Among various voice cloning techniques, this article focuses on zero-shot voice cloning. Although existing TTS models can generate high-quality speech for seen speakers, cloning the voice of an unseen speaker remains a challenging task. The key aspect of zero-shot voice cloning is to obtain a speaker embedding from the target speaker. Previous works have used a speaker encoder to obtain a fixed-size speaker embedding from a single reference audio unsupervised, but they suffer from insufficient speaker information and content information leakage in speaker embedding. To address these issues, this article proposes MRMI-TTS, a FastSpeech2-based framework that uses speaker embedding as a conditioning variable to provide speaker information. The MRMI-TTS extracts speaker embedding and content embedding from multi-reference audios using a speaker encoder and a content encoder. To obtain sufficient speaker information, multi-reference audios are selected based on sentence similarity. The proposed model applies mutual information minimization on the two embeddings to remove entangled information within each embedding. Experiments on the public English dataset VCTK show that our method can improve synthesized speech in terms of both similarity and naturalness, even for unseen speakers. Compared to state-of-the-art reference embedding learned methods, our method achieves the best performance on the zero-shot voice cloning task. Furthermore, we demonstrate that the proposed method has a better capability of maintaining the speaker embedding in different languages. Sample outputs are available on the demo page. 1

Publisher

Association for Computing Machinery (ACM)

Reference45 articles.

1. Neural voice cloning with a few samples;Arik Sercan;Proceedings of the 32nd International Conference on Neural Information Processing Systems,2018

2. Yanyao Bian Changbin Chen Yongguo Kang and Zhenglin Pan. 2019. Multi-reference Tacotron by intercross training for style disentangling transfer and control in speech synthesis. arXiv:1904.02373. Retrieved from https://arxiv.org/abs/1904.02373

3. Weicheng Cai Jinkun Chen and Ming Li. 2018. Exploring the encoding layer and loss function in end-to-end speaker and language recognition system. In Proceedings of the Speaker and Language Recognition Workshop (Odyssey’18) 74–81. Retrieved from https://www.isca-archive.org/odyssey_2018/cai18_odyssey.html

4. Zexin Cai Chuxiong Zhang and Ming Li. 2020. From speaker verification to multispeaker speech synthesis deep transfer with feedback constraint. In Proceedings of Interspeech 3974–3978. Retrieved from https://www.isca-archive.org/interspeech_2020/cai20c_interspeech.html

5. Edresson Casanova Christopher Shulby Eren Gölge Nicolas Michael Müller Frederico Santos de Oliveira Arnaldo Candido Junior Anderson da Silva Soares Sandra Maria Aluisio and Moacir Antonelli Ponti. 2021. SC-GlowTTS: an efficient zero-shot multi-speaker text-to-speech model. In Proceedings of Interspeech 3645–3949. Retrieved from https://www.isca-archive.org/interspeech_2021/casanova21b_interspeech.pdf

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3