Advanced Fusion-Based Speech Emotion Recognition System Using a Dual-Attention Mechanism with Conv-Caps and Bi-GRU Features-Reference-Cited by-同舟云学术

Advanced Fusion-Based Speech Emotion Recognition System Using a Dual-Attention Mechanism with Conv-Caps and Bi-GRU Features

Published:2022-04-22 Issue:9 Volume:11 Page:1328
ISSN:2079-9292
Container-title:Electronics
language:en
Short-container-title:Electronics

Author:

Maji Bubai,Swain Monorama,Mustaqeem Mustaqeem

Abstract

Recognizing the speaker’s emotional state from speech signals plays a very crucial role in human–computer interaction (HCI). Nowadays, numerous linguistic resources are available, but most of them contain samples of a discrete length. In this article, we address the leading challenge in Speech Emotion Recognition (SER), which is how to extract the essential emotional features from utterances of a variable length. To obtain better emotional information from the speech signals and increase the diversity of the information, we present an advanced fusion-based dual-channel self-attention mechanism using convolutional capsule (Conv-Cap) and bi-directional gated recurrent unit (Bi-GRU) networks. We extracted six spectral features (Mel-spectrograms, Mel-frequency cepstral coefficients, chromagrams, the contrast, the zero-crossing rate, and the root mean square). The Conv-Cap module was used to obtain Mel-spectrograms, while the Bi-GRU was used to obtain the rest of the spectral features from the input tensor. The self-attention layer was employed in each module to selectively focus on optimal cues and determine the attention weight to yield high-level features. Finally, we utilized a confidence-based fusion method to fuse all high-level features and pass them through the fully connected layers to classify the emotional states. The proposed model was evaluated on the Berlin (EMO-DB), Interactive Emotional Dyadic Motion Capture (IEMOCAP), and Odia (SITB-OSED) datasets to improve the recognition rate. During experiments, we found that our proposed model achieved high weighted accuracy (WA) and unweighted accuracy (UA) values, i.e., 90.31% and 87.61%, 76.84% and 70.34%, and 87.52% and 86.19%, respectively, demonstrating that the proposed model outperformed the state-of-the-art models using the same datasets.

Publisher

MDPI AG

Subject

Electrical and Electronic Engineering,Computer Networks and Communications,Hardware and Architecture,Signal Processing,Control and Systems Engineering

Link

https://www.mdpi.com/2079-9292/11/9/1328/pdf

Reference74 articles.

1. A generalized zero-shot framework for emotion recognition from body gestures;Wu;arXiv,2020

2. Facial Emotion Recognition Using Hybrid Features

3. Knowledge-based Framework for Intelligent Emotion Recognition in Spontaneous Speech

Cited by 30 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Squeeze-and-excitation 3D convolutional attention recurrent network for end-to-end speech emotion recognition;Applied Soft Computing;2024-08

2. Research into the Influence of Informative Speech Parameters on the Reliability of RNN-based Classification of Human Emotional States;2024 26th International Conference on Digital Signal Processing and its Applications (DSPA);2024-03-27

3. Continuous feature learning representation to XGBoost classifier on the aggregation of discriminative Features using DenseNet-121 architecture and ResNet 18 architectures towards Apraxia Recognition in the Child Speech Therapy;International Journal of Speech Technology;2024-03

4. A Survey on Multi-modal Emotion Detection Techniques;2024-02-13

5. SER-Fuse: An Emotion Recognition Application Utilizing Multi-Modal, Multi-Lingual, and Multi-Feature Fusion;Proceedings of the 12th International Symposium on Information and Communication Technology;2023-12-07