SR-TTS: a rhyme-based end-to-end speech synthesis system-Reference-Cited by-同舟云学术

SR-TTS: a rhyme-based end-to-end speech synthesis system

Published:2024-02-27 Issue: Volume:18 Page:
ISSN:1662-5218
Container-title:Frontiers in Neurorobotics
language:
Short-container-title:Front. Neurorobot.

Author:

Yao Yihao,Liang Tao,Feng Rui,Shi Keke,Yu Junxiao,Wang Wei,Li Jianqing

Abstract

Deep learning has significantly advanced text-to-speech (TTS) systems. These neural network-based systems have enhanced speech synthesis quality and are increasingly vital in applications like human-computer interaction. However, conventional TTS models still face challenges, as the synthesized speeches often lack naturalness and expressiveness. Additionally, the slow inference speed, reflecting low efficiency, contributes to the reduced voice quality. This paper introduces SynthRhythm-TTS (SR-TTS), an optimized Transformer-based structure designed to enhance synthesized speech. SR-TTS not only improves phonological quality and naturalness but also accelerates the speech generation process, thereby increasing inference efficiency. SR-TTS contains an encoder, a rhythm coordinator, and a decoder. In particular, a pre-duration predictor within the cadence coordinator and a self-attention-based feature predictor work together to enhance the naturalness and articulatory accuracy of speech. In addition, the introduction of causal convolution enhances the consistency of the time series. The cross-linguistic capability of SR-TTS is validated by training it on both English and Chinese corpora. Human evaluation shows that SR-TTS outperforms existing techniques in terms of speech quality and naturalness of expression. This technology is particularly suitable for applications that require high-quality natural speech, such as intelligent assistants, speech synthesized podcasts, and human-computer interaction.

Publisher

Frontiers Media SA

Reference30 articles.

1. A comprehensive survey of deep learning techniques natural language processing;Bharadiya;Eur. J. Technol,2023

2. Parallel Tacotron 2: A Non-Autoregressive Neural TTS Model with Differentiable Duration Modeling

3. TTS synthesis with bidirectional LSTM based recurrent neural networks

4. Denoising diffusion probabilistic models;Ho;Adv. Neural Inf. Process. Syst,2020

Cited by 1 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Multi-Granularity Prosodic Speech Synthesis with Grammar Information;2024