Affiliation:
1. National Engineering School of Carthage, Carthage University, Tunis 2035, Tunisia
2. ATISP—Advanced Technologies For Image and Signal Processing, ENET’COM, Sfax University, Sfax 3021, Tunisia
3. LORIA—Laboratoire Lorrain de Recherche en Informatique et ses Applications, B.P. 239, 54506 Vandœuvre-lès-Nancy, France
Abstract
We present an any-to-one voice conversion (VC) system, using an autoregressive model and LPCNet vocoder, aimed at enhancing the converted speech in terms of naturalness, intelligibility, and speaker similarity. As the name implies, non-parallel any-to-one voice conversion does not require paired source and target speeches and can be employed for arbitrary speech conversion tasks. Recent advancements in neural-based vocoders, such as WaveNet, have improved the efficiency of speech synthesis. However, in practice, we find that the trajectory of some generated waveforms is not consistently smooth, leading to occasional voice errors. To address this issue, we propose to use an autoregressive (AR) conversion model along with the high-fidelity LPCNet vocoder. This combination not only solves the problems of waveform fluidity but also produces more natural and clear speech, with the added capability of real-time speech generation. To precisely represent the linguistic content of a given utterance, we use speaker-independent PPG features (SI-PPG) computed from an automatic speech recognition (ASR) model trained on a multi-speaker corpus. Next, a conversion model maps the SI-PPG to the acoustic representations used as input features for the LPCNet. The proposed autoregressive structure enables our system to produce the following prediction step outputs from the acoustic features predicted in the previous step. We evaluate the effectiveness of our system by performing any-to-one conversion pairs between native English speakers. Experimental results show that the proposed method outperforms state-of-the-art systems, producing higher speech quality and greater speaker similarity.
Subject
Fluid Flow and Transfer Processes,Computer Science Applications,Process Chemistry and Technology,General Engineering,Instrumentation,General Materials Science
Reference52 articles.
1. An Overview of Voice Conversion and its Challenges: From Statistical Modeling to Deep Learning;Sisman;IEEE/ACM Trans. Audio Speech Lang. Process.,2020
2. Walczyna, T., and Piotrowski, Z. (2023). Overview of Voice Conversion Methods Based on Deep Learning. Appl. Sci., 13.
3. Voice conversion based on maximum-likelihood estimation of spectral parameter trajectory;Toda;IEEE Trans. ASLP,2007
4. Voice conversion using partial least squares regression;Helander;IEEE Trans. Audio Speech Lang. Process.,2010
5. Erro, D., Alonso, A., Serrano, L., Navas, E., and Hernáez, I. (2013, January 19–21). Towards physically interpretable parametric voice conversion functions. Proceedings of the 6th Advances in Nonlinear Speech Processing International Conference, Mons, Belgium.