A Feature Fusion Model with Data Augmentation for Speech Emotion Recognition-Reference-Cited by-同舟云学术

A Feature Fusion Model with Data Augmentation for Speech Emotion Recognition

Published:2023-03-24 Issue:7 Volume:13 Page:4124
ISSN:2076-3417
Container-title:Applied Sciences
language:en
Short-container-title:Applied Sciences

Author:

Tu Zhongwen¹^ORCID,Liu Bin²,Zhao Wei³,Yan Raoxin²,Zou Yang²

Affiliation:

1. Educational Service Center, Communication University of China, Beijing 100024, China

2. School of Information and Engineering, Communication University of China, Beijing 100024, China

3. School of Data and Intelligence, Communication University of China, Beijing 100024, China

Abstract

The Speech Emotion Recognition (SER) algorithm, which aims to analyze the expressed emotion from a speech, has always been an important topic in speech acoustic tasks. In recent years, the application of deep-learning methods has made great progress in SER. However, the small scale of the emotional speech dataset and the lack of effective emotional feature representation still limit the development of research. In this paper, a novel SER method, combining data augmentation, feature selection and feature fusion, is proposed. First, aiming at the problem that there are inadequate samples in the speech emotion dataset and the number of samples in each category is unbalanced, a speech data augmentation method, Mix-wav, is proposed which is applied to the audio of the same emotion category. Then, on the one hand, a Multi-Head Attention mechanism-based Convolutional Recurrent Neural Network (MHA-CRNN) model is proposed to further extract the spectrum vector from the Log-Mel spectrum. On the other hand, Light Gradient Boosting Machine (LightGBM) is used for feature set selection and feature dimensionality reduction in four emotion global feature sets, and more effective emotion statistical features are extracted for feature fusion with the previously extracted spectrum vector. Experiments are carried out on the public dataset Interactive Emotional Dyadic Motion Capture (IEMOCAP) and Chinese Hierarchical Speech Emotion Dataset of Broadcasting (CHSE-DB). The experiments show that the proposed method achieves 66.44% and 93.47% of the unweighted average test accuracy, respectively. Our research shows that the global feature set after feature selection can supplement the features extracted by a single deep-learning model through feature fusion to achieve better classification accuracy.

Funder

Fundamental Research Funds for the Central Universities

Publisher

MDPI AG

Subject

Fluid Flow and Transfer Processes,Computer Science Applications,Process Chemistry and Technology,General Engineering,Instrumentation,General Materials Science

Link

https://www.mdpi.com/2076-3417/13/7/4124/pdf

Reference51 articles.

1. Speech Emotion Recognition: Emotional Models, Databases, Features, Preprocessing Methods, Supporting Modalities, and Classifiers;Speech Commun.,2020

2. Speech Emotion Recognition Using Deep Learning Techniques: A Review;Khalil;IEEE Access,2019

3. Lv, Z., Poiesi, F., Dong, Q., Lloret, J., and Song, H. (2022). Deep Learning for Intelligent Human–Computer Interaction. Appl. Sci., 12.

4. Burkhardt, F., Paeschke, A., Rolfes, M., Sendlmeier, W.F., and Weiss, B. (2005, January 4–8). A Database of German Emotional Speech. Proceedings of the Interspeech 2005, Lisbon, Portugal.

5. IEMOCAP: Interactive Emotional Dyadic Motion Capture Database;Busso;Lang Resour. Eval.,2008

Cited by 7 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Multi-Label Emotion Recognition of Korean Speech Data Using Deep Fusion Models;Applied Sciences;2024-08-28

2. Enhancing speech emotion recognition through deep learning and handcrafted feature fusion;Applied Acoustics;2024-06

3. MaxMViT-MLP: Multiaxis and Multiscale Vision Transformers Fusion Network for Speech Emotion Recognition;IEEE Access;2024

4. Data Augmentation Impact on Deep learning Performance for Stress Detection;2023 Eleventh International Conference on Intelligent Computing and Information Systems (ICICIS);2023-11-21

5. Feature fusion strategy and improved GhostNet for accurate recognition of fish feeding behavior;Computers and Electronics in Agriculture;2023-11