A BiLSTM–Transformer and 2D CNN Architecture for Emotion Recognition from Speech-Reference-Cited by-同舟云学术

A BiLSTM–Transformer and 2D CNN Architecture for Emotion Recognition from Speech

Published:2023-09-25 Issue:19 Volume:12 Page:4034
ISSN:2079-9292
Container-title:Electronics
language:en
Short-container-title:Electronics

Author:

Kim Sera¹,Lee Seok-Pil²^ORCID

Affiliation:

1. Department of Computer Science, Graduate School, Sangmyung University, Seoul 03016, Republic of Korea

2. Department of Intelligent IoT, Sangmyung University, Seoul 03016, Republic of Korea

Abstract

The significance of emotion recognition technology is continuing to grow, and research in this field enables artificial intelligence to accurately understand and react to human emotions. This study aims to enhance the efficacy of emotion recognition from speech by using dimensionality reduction algorithms for visualization, effectively outlining emotion-specific audio features. As a model for emotion recognition, we propose a new model architecture that combines the bidirectional long short-term memory (BiLSTM)–Transformer and a 2D convolutional neural network (CNN). The BiLSTM–Transformer processes audio features to capture the sequence of speech patterns, while the 2D CNN handles Mel-Spectrograms to capture the spatial details of audio. To validate the proficiency of the model, the 10-fold cross-validation method is used. The methodology proposed in this study was applied to Emo-DB and RAVDESS, two major emotion recognition from speech databases, and achieved high unweighted accuracy rates of 95.65% and 80.19%, respectively. These results indicate that the use of the proposed transformer-based deep learning model with appropriate feature selection can enhance performance in emotion recognition from speech.

Funder

Sangmyung University

Publisher

MDPI AG

Subject

Electrical and Electronic Engineering,Computer Networks and Communications,Hardware and Architecture,Signal Processing,Control and Systems Engineering

Link

https://www.mdpi.com/2079-9292/12/19/4034/pdf

Reference39 articles.

1. Ko, B.C. (2018). A brief review of facial emotion recognition based on visual information. Sensors, 18.

2. A survey on facial emotion recognition techniques: A state-of-the-art literature review;Canal;Inf. Sci.,2022

3. Valstar, M., and Pantic, M. (2006, January 17–22). Fully automatic facial action unit detection and temporal analysis. Proceedings of the IEEE 2006 Conference on Computer Vision and Pattern Recognition Workshop (CVPRW’06), New York, NY, USA.

4. A database of German emotional speech;Burkhardt;Interspeech,2005

5. An ongoing review of speech emotion recognition;Neurocomputing,2023

Cited by 5 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Reconstruction of OFDM Signals Using a Dual Discriminator CGAN with BiLSTM and Transformer;Sensors;2024-07-14

2. Advancing Text Emotion Recognition: BERT and BiLSTM Integration;2024 International Conference on Science Technology Engineering and Management (ICSTEM);2024-04-26

3. Design of smart home system speech emotion recognition model based on ensemble deep learning and feature fusion;Applied Acoustics;2024-03

4. A Deep Learning Approach for Speech Emotion Recognition Optimization Using Meta-Learning;Electronics;2023-12-01

5. Genetic Algorithm for High-Dimensional Emotion Recognition from Speech Signals;Electronics;2023-11-25