1. 1) A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior and K. Kavukcuoglu, "WaveNet: A generative model for raw audio," Proc. SSW9, p. 125 (2016).
2. 2) J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen, Y. Zhang, Y. Wang, R. J. Skerry-Ryan, R. A. Saurous, Y. Agiomyrgiannakis and Y. Wu, "Natural TTS synthesis by conditioning WavaNet on mel spectrogram predictions," Proc. ICASSP 2018, pp. 4779–4783 (2018).
3. 3) A. Tamamori, T. Hayashi, K. Kobayashi, K. Takeda and T. Toda, "Speaker-dependent WaveNet vocoder," Proc. Interspeech 2017, pp. 1118–1122 (2017).
4. 4) J. L.-Trueba, T. Drugman, J. Latorre, T. Merritt, B. Putrycz, R. B.-Chicote, A. Moinet and V. Aggarwal, "Towards achieving robust universal neural vocoding," Proc. Interspeech 2019, pp. 181–185 (2019).
5. 5) J. Kong, J. Kim and J. Bae, "HiFi-GAN: Generative adversarial networks for efficient and high fidelity speech synthesis," Proc. NeurIPS 2020, pp. 17022–17033 (2020).