Text-to-speech system for low-resource language using cross-lingual transfer learning and data augmentation-Reference-Cited by-同舟云学术

Text-to-speech system for low-resource language using cross-lingual transfer learning and data augmentation

Published:2021-12 Issue:1 Volume:2021 Page:
ISSN:1687-4722
Container-title:EURASIP Journal on Audio, Speech, and Music Processing
language:en
Short-container-title:J AUDIO SPEECH MUSIC PROC.

Author:

Byambadorj Zolzaya^ORCID,Nishimura Ryota,Ayush Altangerel,Ohta Kengo,Kitaoka Norihide

Abstract

AbstractDeep learning techniques are currently being applied in automated text-to-speech (TTS) systems, resulting in significant improvements in performance. However, these methods require large amounts of text-speech paired data for model training, and collecting this data is costly. Therefore, in this paper, we propose a single-speaker TTS system containing both a spectrogram prediction network and a neural vocoder for the target language, using only 30 min of target language text-speech paired data for training. We evaluate three approaches for training the spectrogram prediction models of our TTS system, which produce mel-spectrograms from the input phoneme sequence: (1) cross-lingual transfer learning, (2) data augmentation, and (3) a combination of the previous two methods. In the cross-lingual transfer learning method, we used two high-resource language datasets, English (24 h) and Japanese (10 h). We also used 30 min of target language data for training in all three approaches, and for generating the augmented data used for training in methods 2 and 3. We found that using both cross-lingual transfer learning and augmented data during training resulted in the most natural synthesized target speech output. We also compare single-speaker and multi-speaker training methods, using sequential and simultaneous training, respectively. The multi-speaker models were found to be more effective for constructing a single-speaker, low-resource TTS model. In addition, we trained two Parallel WaveGAN (PWG) neural vocoders, one using 13 h of our augmented data with 30 min of target language data and one using the entire 12 h of the original target language dataset. Our subjective AB preference test indicated that the neural vocoder trained with augmented data achieved almost the same perceived speech quality as the vocoder trained with the entire target language dataset. Overall, we found that our proposed TTS system consisting of a spectrogram prediction network and a PWG neural vocoder was able to achieve reasonable performance using only 30 min of target language training data. We also found that by using 3 h of target language data, for training the model and for generating augmented data, our proposed TTS model was able to achieve performance very similar to that of the baseline model, which was trained with 12 h of target language data.

Publisher

Springer Science and Business Media LLC

Subject

Electrical and Electronic Engineering,Acoustics and Ultrasonics

Link

https://link.springer.com/content/pdf/10.1186/s13636-021-00225-4.pdf

Reference55 articles.

1. Y. Wang, R. Skerry-Ryan, D. Stanton, Y. Wu, R. J. Weiss, N. Jaitly, Z. Yang, Y. Xiao, Z. Chen, S. Bengio, Q. Le, Y. Agiomyrgiannakis, R. Clark, R. A. Saurous, in Interspeech 2017: 20-24 August 2017

2. Stockholm. Tacotron: Towards end-to-end speech synthesis (ISCA, 2017), pp. 4006-4010.

3. J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen, Y. Zhang, Y. Wang, R. Skerry-Ryan, R. A. Saurous, Y. Agiomyrgiannakis, Y. Wu, in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP): 15-20 April 2018

4. Canada. Natural tts synthesis by conditioning wavenet on mel spectrogram predictions (IEEE, 2018), pp. 4779-4783.

5. W. Ping, K. Peng, A. Gibiansky, S. O. Arik, A. Kannan, S. Narang, J. Raiman, J. Miller, in 6th International Conference on Learning Representations (ICLR): April 30-May 3, 2018

Cited by 10 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Refining maritime Automatic Speech Recognition by leveraging synthetic speech;Maritime Transport Research;2024-12

2. Using Transfer Learning to Realize Low Resource Dungan Language Speech Synthesis;Applied Sciences;2024-07-20

3. Language technologies for a multilingual public administration in Spain;Revista de Llengua i Dret;2023-06-21

4. Language technologies for a multilingual public administration in Spain;Revista de Llengua i Dret;2023-06-21

5. Exploring Solutions for Text-to-Speech Synthesis of Low-Resource Languages;2023 4th International Conference on Signal Processing and Communication (ICSPC);2023-03-23