Affiliation:
1. Google, France kharitonov@google.com
2. Google, Switzerland. damienv@google.com
3. Google, Switzerland
4. Google, France
5. Google, France neilz@google.com
Abstract
Abstract
We introduce SPEAR-TTS, a multi-speaker text-to-speech (TTS) system that can be trained with minimal supervision. By combining two types of discrete speech representations, we cast TTS as a composition of two sequence-to-sequence tasks: from text to high-level semantic tokens (akin to “reading”) and from semantic tokens to low-level acoustic tokens (“speaking”). Decoupling these two tasks enables training of the “speaking” module using abundant audio-only data, and unlocks the highly efficient combination of pretraining and backtranslation to reduce the need for parallel data when training the “reading” component. To control the speaker identity, we adopt example prompting, which allows SPEAR-TTS to generalize to unseen speakers using only a short sample of 3 seconds, without any explicit speaker representation or speaker labels. Our experiments demonstrate that SPEAR-TTS achieves a character error rate that is competitive with state-of-the-art methods using only 15 minutes of parallel data, while matching ground-truth speech in naturalness and acoustic quality.
Subject
Artificial Intelligence,Computer Science Applications,Linguistics and Language,Human-Computer Interaction,Communication
Reference42 articles.
1. MusicLM: Generating music from text;Agostinelli;arXiv preprint arXiv:2301.11325,2023
2. SpeechT5: Unified-modal encoder-decoder pre-training for spoken language processing;Ao,2022
3. TorToiSe text-to-speech;Betker,2022
4. AudioLM: A language modeling approach to audio generation;Borsos;IEEE/ACM Transactions on Audio, Speech, and Language Processing,2023
5. Language models are few-shot learners;Brown;NeurIPS,2020
Cited by
20 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献