Central Kurdish Text-to-Speech Synthesis with Novel End-to-End Transformer Training-Reference-Cited by-同舟云学术

Central Kurdish Text-to-Speech Synthesis with Novel End-to-End Transformer Training

Published:2024-07-03 Issue:7 Volume:17 Page:292
ISSN:1999-4893
Container-title:Algorithms
language:en
Short-container-title:Algorithms

Author:

Ahmad Hawraz¹^ORCID,Rashid Tarik²

Affiliation:

1. Department of Software and Informatics Engineering, Salahaddin University-Erbil, Erbil 44001, Iraq

2. Department of Computer Science and Engineering, University of Kurdistan Hawler, Erbil 44001, Iraq

Abstract

Recent advancements in text-to-speech (TTS) models have aimed to streamline the two-stage process into a single-stage training approach. However, many single-stage models still lag behind in audio quality, particularly when handling Kurdish text and speech. There is a critical need to enhance text-to-speech conversion for the Kurdish language, particularly for the Sorani dialect, which has been relatively neglected and is underrepresented in recent text-to-speech advancements. This study introduces an end-to-end TTS model for efficiently generating high-quality Kurdish audio. The proposed method leverages a variational autoencoder (VAE) that is pre-trained for audio waveform reconstruction and is augmented by adversarial training. This involves aligning the prior distribution established by the pre-trained encoder with the posterior distribution of the text encoder within latent variables. Additionally, a stochastic duration predictor is incorporated to imbue synthesized Kurdish speech with diverse rhythms. By aligning latent distributions and integrating the stochastic duration predictor, the proposed method facilitates the real-time generation of natural Kurdish speech audio, offering flexibility in pitches and rhythms. Empirical evaluation via the mean opinion score (MOS) on a custom dataset confirms the superior performance of our approach (MOS of 3.94) compared with that of a one-stage system and other two-staged systems as assessed through a subjective human evaluation.

Funder

Salahaddin University Erbil

Publisher

MDPI AG

Link

https://www.mdpi.com/1999-4893/17/7/292/pdf

Reference43 articles.

1. Shen, J., Pang, R., Weiss, R.J., Schuster, M., Jaitly, N., Yang, Z., Chen, Z., Zhang, Y., Wang, Y., and Skerrv-Ryan, R. (2018, January 15–20). Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada.

2. Oord, A.v.d., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A., and Kavukcuoglu, K. (2016). Wavenet: A generative model for raw audio. arXiv.

3. Kalchbrenner, N., Elsen, E., Simonyan, K., Noury, S., Casagrande, N., Lockhart, E., Stimberg, F., Oord, A., Dieleman, S., and Kavukcuoglu, K. (2018, January 10–15). Efficient neural audio synthesis. Proceedings of the International Conference on Machine Learning, PMLR, Stockholm, Sweden.

4. Ren, Y., Ruan, Y., Tan, X., Qin, T., Zhao, S., Zhao, Z., and Liu, T.Y. (2019). Fastspeech: Fast, robust and controllable text to speech. arXiv.

5. Peng, K., Ping, W., Song, Z., and Zhao, K. (2020, January 13–18). Non-autoregressive neural text-to-speech. Proceedings of the International Conference on Machine Learning, PMLR, Virtual.