1. SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing
2. Common Voice: A massively-multilingual speech corpus;ardila,2019
3. A3T: Alignment-aware acoustic and text pretraining for speech synthesis and editing;bai;Proc ICML,2022
4. Maestro-U: leveraging joint speech–text representation learning for zero supervised speech ASR;chen,2022
5. Improving Speech Recognition Using Consistent Predictions on Synthesized Speech