1. 1) G. Hinton, L. Deng, D. Yu, G. Dahl, A. r. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. Sainath and B. Kingsbury, ``Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups,'' Signal Process. Mag., 29, 82-97 (2012).
2. 2) A. v. d. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. W. Senior and K. Kavukcuoglu, ``WaveNet: A generative model for raw audio,'' arXiv 1609.03499 (2016).
3. 3) S. Takamichi, K. Tomoki and H. Saruwatari, ``Sampling-based speech parameter generation using moment-matching networks,'' Proc. Interspeech, Stockholm, Sweden, Aug., pp. 3961-3965 (2017).
4. 4) Y. Saito, S. Takamichi and H. Saruwatari, ``Statistical parametric speech synthesis incorporating generative adversarial networks,'' IEEE/ACM Trans. Audio Speech Lang. Process., 26, 84-96 (2018).
5. 5) M. Abe, Y. Sagisaka, T. Umeda and H. Kuwabara, ``Speech database user manual,'' ATR Tech. Rep., no. TR-I-0166M (1990).