Disentangling Content Information by Combining ASR and TTS Bottleneck Features for Voice Conversion-Reference-Cited by-同舟云学术

Disentangling Content Information by Combining ASR and TTS Bottleneck Features for Voice Conversion

Published:2023-03 Issue:01 Volume:33 Page:
ISSN:2717-5545
Container-title:International Journal of Asian Language Processing
language:en
Short-container-title:Int. J. As. Lang. Proc.

Author:

Zhao Zeqing¹^ORCID,Ma Sifan¹^ORCID,Jia Yan¹^ORCID,Hou Jingyu¹^ORCID,Yang Lin¹^ORCID,Wang Junjie¹^ORCID

Affiliation:

1. AI Lab, Lenovo Research, Haidian District, Beijing 100094, P. R. China

Abstract

With the development of deep learning, nonparallel voice conversion (VC) has achieved a significant progress recently. Automatic speech recognition (ASR) and text-to-speech (TTS) for leveraging knowledge are the two mainstream methods in VC research. In this paper, we demonstrate that the two bottleneck features (BNFs) in the above methods are complementary. ASR-BNFs are more robust especially in any-to-many tasks, but suffer from leakage of source speaker’s timbre information; TTS-BNFs are less likely to reveal speaker’s timbre information, but lack robustness. Therefore, a nonparallel any-to-many voice conversion model is proposed by combining ASR-BNFs and TTS-BNFs. The whole modules in the proposed model can be trained jointly without any pre-trained models. Experiments are conducted on a private multi-speaker TTS dataset. It is demonstrated that the proposed model achieves the best balance in speech quality, timbre similarity and robustness compared to baseline models.

Publisher

World Scientific Pub Co Pte Ltd

Subject

General Earth and Planetary Sciences,General Engineering,General Environmental Science

Link

https://www.worldscientific.com/doi/pdf/10.1142/S271755452350011X

Reference18 articles.

1. Cyclegan-VC2: Improved Cyclegan-based Non-parallel Voice Conversion

2. Transfer Learning From Speech Synthesis to Voice Conversion With Non-Parallel Training Data

3. Phonetic posteriorgrams for many-to-one voice conversion without parallel data training

4. Sequence-to-Sequence Acoustic Modeling for Voice Conversion