1. Zen H, Nose T, Yamagishi J, Sako S, Masuko T, Black AW, Tokuda K (2007) The HMM-based speech synthesis system (HTS) version 2.0. SSW 6:294–299
2. Van den Oord A, Kalchbrenner N, Espeholt L, Vinyals O, Graves A (2016) Conditional image generation with pixelcnn decoders. Adv Neural Inf Process Syst 29:1–9. ArXiv, abs/1606.05328
3. Van Den Oord A, Kalchbrenner N, Kavukcuoglu K (2016) Pixel recurrent neural networks. In: International Conference on Machine Learning, MLR, pp. 1747–1756
4. Wang Y, Skerry-Ryan RJ, Stanton D, Wu Y, Weiss RJ, Jaitly N, Yang Z, Xiao Y, Chen Z, Bengio S, Le Q, Agiomyrgiannakis Y, Clark Y, Saurous RA, Saurous RA (2017) Tacotron: Towards end-to-end speech synthesis. arXiv preprint arXiv:1703.10135
5. Griffin D, Lim J (1984) Signal estimation from modified short-time Fourier transform. IEEE Trans Acoust Speech Signal Process 32(2):236–243