Author:
Liu Fangkun,Wang Hui,Peng Renhua,Zheng Chengshi,Li Xiaodong
Abstract
AbstractVoice conversion is to transform a source speaker to the target one, while keeping the linguistic content unchanged. Recently, one-shot voice conversion gradually becomes a hot topic for its potentially wide range of applications, where it has the capability to convert the voice from any source speaker to any other target speaker even when both the source speaker and the target speaker are unseen during training. Although a great progress has been made in one-shot voice conversion, the naturalness of the converted speech remains a challenging problem. To further improve the naturalness of the converted speech, this paper proposes a two-level nested U-structure (U2-Net) voice conversion algorithm called U2-VC. The U2-Net can extract both local feature and multi-scale feature of log-mel spectrogram, which can help to learn the time-frequency structures of the source speech and the target speech. Moreover, we adopt sandwich adaptive instance normalization (SaAdaIN) in decoder for speaker identity transformation to retain more content information of the source speech while maintaining the speaker similarity between the converted speech and the target speech. Experiments on VCTK dataset show that U2-VC outperforms many SOTA approaches including AGAIN-VC and AdaIN-VC in terms of both objective and subjective measurements.
Funder
national natural science foundation of china
national key r&d program of china
Publisher
Springer Science and Business Media LLC
Subject
Electrical and Electronic Engineering,Acoustics and Ultrasonics
Reference38 articles.
1. K. Qian, Y. Zhang, S. Chang, D. Cox, M. Hasegawa-Johnson, Unsupervised speech decomposition via triple information bottleneck (2021). http://arxiv.org/abs/2004.11284. Accessed 2020.
2. L. W. Chen, H. Y. Lee, T. Yu, in Interspeech 2019. Generative adversarial networks for unpaired voice transformation on impaired speech (ISCA, 2019). https://doi.org/10.21437/interspeech.2019-1265.
3. K. Nakamura, T. Toda, H. Saruwatari, K. Shikano, Speaking-aid systems using GMM-based voice conversion for electrolaryngeal speech. Speech Commun.54(1), 134–146 (2012).
4. M. Zhang, X. Wang, F. Fang, H. Li, J. Yamagishi, in Interspeech 2019. Joint training framework for text-to-speech and voice conversion using multi-source tacotron and wavenet (ISCA, 2019). https://doi.org/10.21437/interspeech.2019-1357.
5. S. Zhao, T. H. Nguyen, H. Wang, B. Ma, in Proc. Interspeech 2020. Towards natural bilingual and code-switched speech synthesis based on mix of monolingual recordings and cross-lingual voice conversion, (2020), pp. 2927–2931. https://doi.org/10.21437/Interspeech.2020-1163.
Cited by
4 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献