Multi-Input Speech Emotion Recognition Model Using Mel Spectrogram and GeMAPS-Reference-Cited by-同舟云学术

Multi-Input Speech Emotion Recognition Model Using Mel Spectrogram and GeMAPS

Published:2023-02-03 Issue:3 Volume:23 Page:1743
ISSN:1424-8220
Container-title:Sensors
language:en
Short-container-title:Sensors

Author:

Toyoshima Itsuki¹,Okada Yoshifumi²,Ishimaru Momoko¹,Uchiyama Ryunosuke¹,Tada Mayu¹

Affiliation:

1. Division of Information and Electronic Engineering, Muroran Institute of Technology, 27-1, Mizumoto-cho, Muroran 050-8585, Hokkaido, Japan

2. College of Information and Systems, Muroran Institute of Technology, 27-1, Mizumoto-cho, Muroran 050-8585, Hokkaido, Japan

Abstract

The existing research on emotion recognition commonly uses mel spectrogram (MelSpec) and Geneva minimalistic acoustic parameter set (GeMAPS) as acoustic parameters to learn the audio features. MelSpec can represent the time-series variations of each frequency but cannot manage multiple types of audio features. On the other hand, GeMAPS can handle multiple audio features but fails to provide information on their time-series variations. Thus, this study proposes a speech emotion recognition model based on a multi-input deep neural network that simultaneously learns these two audio features. The proposed model comprises three parts, specifically, for learning MelSpec in image format, learning GeMAPS in vector format, and integrating them to predict the emotion. Additionally, a focal loss function is introduced to address the imbalanced data problem among the emotion classes. The results of the recognition experiments demonstrate weighted and unweighted accuracies of 0.6657 and 0.6149, respectively, which are higher than or comparable to those of the existing state-of-the-art methods. Overall, the proposed model significantly improves the recognition accuracy of the emotion “happiness”, which has been difficult to identify in previous studies owing to limited data. Therefore, the proposed model can effectively recognize emotions from speech and can be applied for practical purposes with future development.

Publisher

MDPI AG

Subject

Electrical and Electronic Engineering,Biochemistry,Instrumentation,Atomic and Molecular Physics, and Optics,Analytical Chemistry

Link

https://www.mdpi.com/1424-8220/23/3/1743/pdf

Reference54 articles.

1. Kolakowska, A., Szwoch, W., and Szwoch, M. (2020). A Review of Emotion Recognition Methods Based on Data Acquired via Smartphone Sensors. Sensors, 20.

2. A survey of speech emotion recognition in natural environment;Fahad;Digit. Signal Process.,2021

3. Real-time emotion recognition system with multiple physiological signals;Zhuang;J. Adv. Mech. Des. Syst. Manuf.,2019

4. Emotion recognition using multimodal deep learning;Wei;Neural Information Processing: ICONIP 2016,2016

5. Alsharekh, M.F. (2022). Facial Emotion Recognition in Verbal Communication Based on Deep Learning. Sensors, 22.

Cited by 4 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Deep learned features selection algorithm: Removal operation of anomaly feature maps (RO-AFM);Applied Soft Computing;2024-09

2. Exploring the Effectiveness of the Phase Features on Double Compressed AMR Speech Detection;Applied Sciences;2024-05-26

3. Penetration State Identification of Aluminum Alloy Cold Metal Transfer Based on Arc Sound Signals Using Multi-Spectrogram Fusion Inception Convolutional Neural Network;Electronics;2023-12-06

4. Enhancing Speech Emotion Recognition Using Dual Feature Extraction Encoders;Sensors;2023-07-24