1. Beerends, J., et al.: Perceptual objective listening quality assessment (POLQA), the third generation ITU-T standard for end-to-end speech quality measurement part I-temporal alignment. AES: J. Audio Eng. Soc. 61, 366–384 (2013)
2. Casanova, E., Weber, J., Shulby, C., Junior, A.C., Gölge, E., Ponti, M.A.: YourTTS: towards zero-shot multi-speaker TTS and zero-shot voice conversion for everyone (2021)
3. Cho, H., Jung, W., Lee, J., Woo, S.H.: SANE-TTS: stable and natural end-to-end multilingual text-to-speech. In: Ko, H., Hansen, J.H.L. (eds.) 23rd Annual Conference of the International Speech Communication Association, Interspeech 2022, Incheon, Korea, 18–22 September 2022, pp. 1–5. ISCA (2022). https://doi.org/10.21437/Interspeech.2022-46
4. Delalez, S., Akue, L.: Neural TTS in French: comparing graphemic and phonetic inputs using the SynPaFlex-Corpus and Tacotron2 (2023)
5. Elias, I., et al.: Parallel Tacotron 2: a non-autoregressive neural TTS model with differentiable duration modeling. In: Proceedings of the Interspeech 2021, pp. 141–145 (2021). https://doi.org/10.21437/Interspeech.2021-1461