1. Y. Ren, Y. Ruan, X. Tan, T. Qin, S. Zhao, Z. Zhao, T.-Y. Liu, FastSpeech: Fast, Robust and Controllable Text to Speech, in: NeurIPS, 2019, pp. 3165–3174.
2. Y. Ren, X. Hu, T. Qin, S. Zhao, Z. Zhao, T.-Y. Liu, FastSpeech 2: Fast and High-Quality End-to-End Text to Speech, in: ICLR, 2021.
3. Y. Wang, D. Stanton, Y. Zhang, R. Skerry-Ryan, E. Battenberg, et al., Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis, in: ICML, 2018, pp. 5167–5176.
4. Y. Ren, J. Liu, Z. Zhao, PortaSpeech: Portable and High-Quality Generative Text-to-Speech, in: NeurIPS, 2021, pp. 13963–13974.
5. Correlation based speech-video synchronization;El-Sallam;Pattern Recognit. Lett.,2011