1. Shen, J., Pang, R., Weiss, R.J., Schuster, M., Jaitly, N., Yang, Z., Chen, Z., Zhang, Y., Wang, Y., and Skerrv-Ryan, R. (2018, January 15–20). Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada.
2. Oord, A.v.d., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A., and Kavukcuoglu, K. (2016). Wavenet: A generative model for raw audio. arXiv.
3. Kalchbrenner, N., Elsen, E., Simonyan, K., Noury, S., Casagrande, N., Lockhart, E., Stimberg, F., Oord, A., Dieleman, S., and Kavukcuoglu, K. (2018, January 10–15). Efficient neural audio synthesis. Proceedings of the International Conference on Machine Learning, PMLR, Stockholm, Sweden.
4. Ren, Y., Ruan, Y., Tan, X., Qin, T., Zhao, S., Zhao, Z., and Liu, T.Y. (2019). Fastspeech: Fast, robust and controllable text to speech. arXiv.
5. Peng, K., Ping, W., Song, Z., and Zhao, K. (2020, January 13–18). Non-autoregressive neural text-to-speech. Proceedings of the International Conference on Machine Learning, PMLR, Virtual.