Abstract
Meticulous learning of human emotions through speech is an indispensable function of modern speech emotion recognition (SER) models. Consequently, deriving and interpreting various crucial speech features from raw speech data are complicated responsibilities in terms of modeling to improve performance. Therefore, in this study, we developed a novel SER model via attention-oriented parallel convolutional neural network (CNN) encoders that parallelly acquire important features that are used for emotion classification. Particularly, MFCC, paralinguistic, and speech spectrogram features were derived and encoded by designing different CNN architectures individually for the features, and the encoded features were fed to attention mechanisms for further representation, and then classified. Empirical veracity executed on EMO-DB and IEMOCAP open datasets, and the results showed that the proposed model is more efficient than the baseline models. Especially, weighted accuracy (WA) and unweighted accuracy (UA) of the proposed model were equal to 71.8% and 70.9% in EMO-DB dataset scenario, respectively. Moreover, WA and UA rates were 72.4% and 71.1% with the IEMOCAP dataset.
Funder
MSIT (Ministry of Science and ICT), Korea
Gachon University
Subject
Electrical and Electronic Engineering,Computer Networks and Communications,Hardware and Architecture,Signal Processing,Control and Systems Engineering
Reference53 articles.
1. Zhang, Y., Du, J., Wang, Z., Zhang, J., and Tu, Y. (2018, January 12–15). Attention Based Fully Convolutional Network for Speech Emotion Recognition. Proceedings of the 2018 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Honolulu, HI, USA.
2. Speech Emotion Recognition Using Deep Convolutional Neural Network and Discriminant Temporal Pyramid Matching;Zhang;IEEE Trans. Multimed.,2018
3. Speech emotion recognition based on feature selection and extreme learning machine decision tree;Liu;Neurocomputing,2018
4. Schuller, B., Rigoll, G., and Lang, M. (2003, January 6–9). Hidden Markov Model based speech emotion recognition. Proceedings of the International Conference on Multimedia & Expo, Baltimore, MD, USA.
5. New, T.L., Foo, S.W., and Silva, L.C.D. (2003, January 6–10). Classification of stress in speech using linear and nonlinear features. Proceedings of the 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003 Proceedings (ICASSP ’03), Hong Kong, China.
Cited by
16 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献