FastDiff: A Fast Conditional Diffusion Model for High-Quality Speech Synthesis-Reference-Cited by-同舟云学术

FastDiff: A Fast Conditional Diffusion Model for High-Quality Speech Synthesis

Published:2022-07 Issue: Volume: Page:
ISSN:
Container-title:Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence
language:
Short-container-title:

Author:

Huang Rongjie¹,Lam Max W. Y.²,Wang Jun²,Su Dan²,Yu Dong³,Ren Yi¹,Zhao Zhou¹

Affiliation:

1. Zhejiang University

2. Tencent AI Lab, China

3. Tencent AI Lab, USA

Abstract

Denoising diffusion probabilistic models (DDPMs) have recently achieved leading performances in many generative tasks. However, the inherited iterative sampling process costs hindered their applications to speech synthesis. This paper proposes FastDiff, a fast conditional diffusion model for high-quality speech synthesis. FastDiff employs a stack of time-aware location-variable convolutions of diverse receptive field patterns to efficiently model long-term time dependencies with adaptive conditions. A noise schedule predictor is also adopted to reduce the sampling steps without sacrificing the generation quality. Based on FastDiff, we design an end-to-end text-to-speech synthesizer, FastDiff-TTS, which generates high-fidelity speech waveforms without any intermediate feature (e.g., Mel-spectrogram). Our evaluation of FastDiff demonstrates the state-of-the-art results with higher-quality (MOS 4.28) speech samples. Also, FastDiff enables a sampling speed of 58x faster than real-time on a V100 GPU, making diffusion models practically applicable to speech synthesis deployment for the first time. We further show that FastDiff generalized well to the mel-spectrogram inversion of unseen speakers, and FastDiff-TTS outperformed other competing methods in end-to-end text-to-speech synthesis. Audio samples are available at https://FastDiff.github.io/.

Publisher

International Joint Conferences on Artificial Intelligence Organization

Cited by 27 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Intelligent Model Update Strategy for Sequential Recommendation;Proceedings of the ACM Web Conference 2024;2024-05-13

2. Controllable Data Generation by Deep Learning: A Review;ACM Computing Surveys;2024-04-25

3. Latent diffusion transformer for point cloud generation;The Visual Computer;2024-04-22

4. DCTTS: Discrete Diffusion Model with Contrastive Learning for Text-to-Speech Generation;ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP);2024-04-14

5. Fregrad: Lightweight and Fast Frequency-Aware Diffusion Vocoder;ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP);2024-04-14