Speech Emotion Recognition Based on Temporal-Spatial Learnable Graph Convolutional Neural Network-Reference-Cited by-同舟云学术

Speech Emotion Recognition Based on Temporal-Spatial Learnable Graph Convolutional Neural Network

Published:2024-05-21 Issue:11 Volume:13 Page:2010
ISSN:2079-9292
Container-title:Electronics
language:en
Short-container-title:Electronics

Author:

Yan Jingjie¹,Li Haihua¹,Xu Fengfeng¹,Zhou Xiaoyang²³,Liu Ying⁴,Yang Yuan⁵^ORCID

Affiliation:

1. Jiangsu Key Laboratory of Intelligent Information Processing and Communication Technology, College of Telecommunications and Information Engineering, Nanjing University of Posts and Telecommunications, Nanjing 210003, China

2. School of Information Science and Engineering, Southeast University, Nanjing 210096, China

3. China Mobile Zijin (Jiangsu) Innovation Research Institute Co., Ltd., Nanjing 211189, China

4. China Mobile Communications Group Jiangsu Co., Ltd., Nanjing Branch, Nanjing 211135, China

5. School of Instrument Science and Engineering, Southeast University, Nanjing 210096, China

Abstract

The Graph Convolutional Neural Networks (GCN) method has shown excellent performance in the field of deep learning, and using graphs to represent speech data is a computationally efficient and scalable approach. In order to enhance the adequacy of graph neural networks in extracting speech emotional features, this paper proposes a Temporal-Spatial Learnable Graph Convolutional Neural Network (TLGCNN) for speech emotion recognition. TLGCNN firstly utilizes the Open-SMILE toolkit to extract frame-level speech emotion features. Then, a bidirectional long short-term memory (Bi LSTM) network is used to process the long-term dependencies of speech features which can further extract deep frame-level emotion features. The extracted frame-level emotion features are then input into subsequent network through two pathways. Finally, one pathway constructs the extracted frame-level deep emotion feature vectors into a graph structure applying an adaptive adjacency matrix to catch latent spatial connections, while the other pathway concatenates emotion feature vectors with graph-level embedding obtained from learnable graph convolutional neural network for prediction and classification. Through these two pathways, TLGCNN can simultaneously obtain temporal speech emotional information through Bi-LSTM and spatial speech emotional information through Learnable Graph Convolutional Neural (LGCN) network. Experimental results demonstrate that this method achieves weighted accuracy of 66.82% and 58.35% on the IEMOCAP and MSP-IMPROV databases, respectively.

Funder

the National Natural Science Foundation of China

Open Project of Blockchain Technology and Data Security Key Laboratory Ministry of Industry and Information Technology

Publisher

MDPI AG

Link

https://www.mdpi.com/2079-9292/13/11/2010/pdf

Reference42 articles.

1. Kosti, R., Alvarez, J.M., Recasens, A., and Lapedriza, A. (2017, January 21–26). Emotion Recognition in Context. Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA.

2. Survey on Speech Emotion Recognition: Features, Classification Schemes, and Databases;Kamel;Pattern Recognit.,2011

3. Lakomkin, E., Zamani, M.A., Weber, C., Magg, S., and Wermter, S. (2018, January 1–5). On the Robustness of Speech Emotion Recognition for Human-Robot Interaction with Deep Neural Networks. Proceedings of the 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Madrid, Spain.

4. Li, H.-C., Pan, T., Lee, M.-H., and Chiu, H.-W. (2021). Make Patient Consultation Warmer: A Clinical Application for Speech Emotion Recognition. Appl. Sci., 11.

5. Appuhamy, E.J.G.S., Madhusanka, B.G.D.A., and Herath, H.M.K.K.M.B. (2023). Computational Methods in Psychiatry, Springer.