Any-to-One Non-Parallel Voice Conversion System Using an Autoregressive Conversion Model and LPCNet Vocoder-Reference-Cited by-同舟云学术

Any-to-One Non-Parallel Voice Conversion System Using an Autoregressive Conversion Model and LPCNet Vocoder

Published:2023-11-02 Issue:21 Volume:13 Page:11988
ISSN:2076-3417
Container-title:Applied Sciences
language:en
Short-container-title:Applied Sciences

Author:

Ezzine Kadria¹²,Di Martino Joseph³,Frikha Mondher²^ORCID

Affiliation:

1. National Engineering School of Carthage, Carthage University, Tunis 2035, Tunisia

2. ATISP—Advanced Technologies For Image and Signal Processing, ENET’COM, Sfax University, Sfax 3021, Tunisia

3. LORIA—Laboratoire Lorrain de Recherche en Informatique et ses Applications, B.P. 239, 54506 Vandœuvre-lès-Nancy, France

Abstract

We present an any-to-one voice conversion (VC) system, using an autoregressive model and LPCNet vocoder, aimed at enhancing the converted speech in terms of naturalness, intelligibility, and speaker similarity. As the name implies, non-parallel any-to-one voice conversion does not require paired source and target speeches and can be employed for arbitrary speech conversion tasks. Recent advancements in neural-based vocoders, such as WaveNet, have improved the efficiency of speech synthesis. However, in practice, we find that the trajectory of some generated waveforms is not consistently smooth, leading to occasional voice errors. To address this issue, we propose to use an autoregressive (AR) conversion model along with the high-fidelity LPCNet vocoder. This combination not only solves the problems of waveform fluidity but also produces more natural and clear speech, with the added capability of real-time speech generation. To precisely represent the linguistic content of a given utterance, we use speaker-independent PPG features (SI-PPG) computed from an automatic speech recognition (ASR) model trained on a multi-speaker corpus. Next, a conversion model maps the SI-PPG to the acoustic representations used as input features for the LPCNet. The proposed autoregressive structure enables our system to produce the following prediction step outputs from the acoustic features predicted in the previous step. We evaluate the effectiveness of our system by performing any-to-one conversion pairs between native English speakers. Experimental results show that the proposed method outperforms state-of-the-art systems, producing higher speech quality and greater speaker similarity.

Publisher

MDPI AG

Subject

Fluid Flow and Transfer Processes,Computer Science Applications,Process Chemistry and Technology,General Engineering,Instrumentation,General Materials Science

Link

https://www.mdpi.com/2076-3417/13/21/11988/pdf

Reference52 articles.

1. An Overview of Voice Conversion and its Challenges: From Statistical Modeling to Deep Learning;Sisman;IEEE/ACM Trans. Audio Speech Lang. Process.,2020

2. Walczyna, T., and Piotrowski, Z. (2023). Overview of Voice Conversion Methods Based on Deep Learning. Appl. Sci., 13.

3. Voice conversion based on maximum-likelihood estimation of spectral parameter trajectory;Toda;IEEE Trans. ASLP,2007

4. Voice conversion using partial least squares regression;Helander;IEEE Trans. Audio Speech Lang. Process.,2010

5. Erro, D., Alonso, A., Serrano, L., Navas, E., and Hernáez, I. (2013, January 19–21). Towards physically interpretable parametric voice conversion functions. Proceedings of the 6th Advances in Nonlinear Speech Processing International Conference, Mons, Belgium.