1. Arık, S.Ö., et al.: Deep voice: real-time neural text-to-speech. In: International Conference on Machine Learning, pp. 195–204. PMLR (2017)
2. Casanova, E., Weber, J., Shulby, C.D., Junior, A.C., Gölge, E., Ponti, M.A.: Yourtts: towards zero-shot multi-speaker TTS and zero-shot voice conversion for everyone. In: International Conference on Machine Learning, pp. 2709–2720. PMLR (2022)
3. Chen, L.W., Watanabe, S., Rudnicky, A.I.: A vector quantized approach for text to speech synthesis on real-world spontaneous speech. arXiv abs/2302.04215 (2023). https://api.semanticscholar.org/CorpusID:256662411
4. Chen, N., Zhang, Y., Zen, H., Weiss, R.J., Norouzi, M., Chan, W.: Wavegrad: estimating gradients for waveform generation. arXiv preprint arXiv:2009.00713 (2020). http://arxiv.org/abs/2009.00713
5. D’Alessandro, N., Sebbe, R., Bozkurt, B., Dutoit, T.: Maxmbrola: a max/msp mbrola-based tool for real-time voice synthesis. In: 2005 13th European Signal Processing Conference, pp. 1–4. IEEE (2005)