Author:
Jia Ning,Zheng Chunjun,Sun Wei
Abstract
Abstract
The generation of emotional speech is a challenging and widely applied research topic in the field of speech processing. Because the design method of effective speech feature expression and generation model directly affects the accuracy of emotional speech generation, it is difficult to find a general solution of emotional speech synthesis. In this paper, the CycleGAN model is used as the starting point, and the improved convolution neural network (CNN) model and identity mapping loss scheme are used to achieve effective timing information capture. At the same time, we learn the positive mapping and the reverse mapping to find the best matching design scheme, and retain the speech information in this process, without relying on other audio data. Experiments show that the emotional speech can be accurately recognized by comparing the speech emotion before and after the improvement on the speech corpus of children’s reading. By comparing with the common emotional speech generation model, the advantages of the model proposed in this paper are verified.
Subject
General Physics and Astronomy
Reference12 articles.
1. Multimodal Speech Synthesis Architecture for Unsupervised Speaker Adaptation[C];Luong,2018
2. LSTM-Based Robust Voicing Decision Applied to DNN-Based Speech Synthesis[J];Pradeep;Automatic Control and Computer Sciences,2019
3. Speech emotion recognition using emotion perception spectral feature[J];Jiang;Concurrency and Computation Practice and Experience,2019
Cited by
4 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献