Abstract
AbstractNon-parallel data voice conversion (VC) has achieved considerable breakthroughs due to self-supervised pre-trained representation (SSPR) being used in recent years. Features extracted by the pre-trained model are expected to contain more content information. However, in common VC with SSPR, there is no special implementation to remove speaker information in the content representation extraction by SSPR, which prevents further purification of the speaker information from SSPR representation. Moreover, in conventional VC, Mel-spectrogram is often selected as the reconstructed acoustic feature, which is not consistent with the input of the content encoder and results in some information lost. Motivated by the above, we proposed W2VC to settle the issues. W2VC consists of three parts: (1) We reconstruct feature from WavLM representation (WLMR) that is more consistent with the input of content encoder; (2) Connectionist temporal classification (CTC) is used to align content representation and text context from phoneme level, content encoder plus gradient reversal layer (GRL) based speaker classifier are used to remove speaker information in the content representation extraction; (3) WLMR-based HiFi-GAN is trained to convert WLMR to waveform speech. VC experimental results show that GRL can purify well the content information of the self-supervised model. The GRL purification and CTC supervision on the content encoder are complementary in improving the VC performance. Moreover, the synthesized speech using the WLMR retrained vocoder achieves better results in both subjective and objective evaluation. The proposed method is evaluated on the VCTK and CMU databases. It is shown the method achieves 8.901 in objective MCD, 4.45 in speech naturalness, and 3.62 in speaker similarity of subjective MOS score, which is superior to the baseline.
Funder
Key Technologies Research and Development Program
Publisher
Springer Science and Business Media LLC
Subject
Electrical and Electronic Engineering,Acoustics and Ultrasonics
Reference35 articles.
1. A. Kain, M.W. Macon, in Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP’98 (Cat. No. 98CH36181). Spectral voice conversion for text-to-speech synthesis, vol. 1 (IEEE, 1998), pp. 285–288
2. K. Kobayashi, T. Toda, T. Nakano, M. Goto, G. Neubig, S. Sakti, S. Nakamura, in 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Regression approaches to perceptual age control in singing voice conversion (IEEE, 2014), pp. 7904–7908
3. Z. Du, B. Sisman, K. Zhou, H. Li, in 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). Expressive voice conversion: A joint framework for speaker identity and emotional style transfer (IEEE, 2021), pp. 594–601
4. C.C. Hsu, H.T. Hwang, Y.C. Wu, Y. Tsao, H.M. Wang, in 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA). Voice conversion from non-parallel corpora using variational auto-encoder (IEEE, 2016), pp. 1–6
5. X. Tian, J. Wang, H. Xu, E.S. Chng, H. Li, in Odyssey. Average modeling approach to voice conversion with non-parallel data, vol. 2018 (2018), pp. 227–232
Cited by
1 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献