Speech Emotion Recognition Based on Parallel CNN-Attention Networks with Multi-Fold Data Augmentation-Reference-Cited by-同舟云学术

Speech Emotion Recognition Based on Parallel CNN-Attention Networks with Multi-Fold Data Augmentation

Published:2022-11-28 Issue:23 Volume:11 Page:3935
ISSN:2079-9292
Container-title:Electronics
language:en
Short-container-title:Electronics

Author:

Bautista John Lorenzo^ORCID,Lee Yun Kyung^ORCID,Shin Hyun Soon

Abstract

In this paper, an automatic speech emotion recognition (SER) task of classifying eight different emotions was experimented using parallel based networks trained using the Ryeson Audio-Visual Dataset of Speech and Song (RAVDESS) dataset. A combination of a CNN-based network and attention-based networks, running in parallel, was used to model both spatial features and temporal feature representations. Multiple Augmentation techniques using Additive White Gaussian Noise (AWGN), SpecAugment, Room Impulse Response (RIR), and Tanh Distortion techniques were used to augment the training data to further generalize the model representation. Raw audio data were transformed into Mel-Spectrograms as the model’s input. Using CNN’s proven capability in image classification and spatial feature representations, the spectrograms were treated as an image with the height and width represented by the spectrogram’s time and frequency scales. Temporal feature representations were represented by attention-based models Transformer, and BLSTM-Attention modules. Proposed architectures of the parallel CNN-based networks running along with Transformer and BLSTM-Attention modules were compared with standalone CNN architectures and attention-based networks, as well as with hybrid architectures with CNN layers wrapped in time-distributed wrappers stacked on attention-based networks. In these experiments, the highest accuracy of 89.33% for a Parallel CNN-Transformer network and 85.67% for a Parallel CNN-BLSTM-Attention Network were achieved on a 10% hold-out test set from the dataset. These networks showed promising results based on their accuracies, while keeping significantly less training parameters compared with non-parallel hybrid models.

Funder

Ministry of Trade, Industry and Energy

Publisher

MDPI AG

Subject

Electrical and Electronic Engineering,Computer Networks and Communications,Hardware and Architecture,Signal Processing,Control and Systems Engineering

Link

https://www.mdpi.com/2079-9292/11/23/3935/pdf

Reference39 articles.

1. Emotion recognition in human-computer interaction;Cowie;IEEE Signal Process. Mag.,2001

2. A Metaverse: Taxonomy, Components, Applications, and Open Challenges;Park;IEEE Access,2022

3. Emotion Communication System;Chen;IEEE Access,2016

4. Speech Emotion Recognition Using Deep Learning Techniques: A Review;Khalil;IEEE Access,2019

5. A Comprehensive Review of Speech Emotion Recognition Systems;Wani;IEEE Access,2021

Cited by 18 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. CNN-Based Models for Emotion and Sentiment Analysis Using Speech Data;ACM Transactions on Asian and Low-Resource Language Information Processing;2024-08-08

2. Data augmentation using a 1D-CNN model with MFCC/MFMC features for speech emotion recognition;Automatika;2024-07-03

3. Interpretable machine learning-based text classification method for construction quality defect reports;Journal of Building Engineering;2024-07

4. Analyzing the influence of different speech data corpora and speech features on speech emotion recognition: A review;Speech Communication;2024-07

5. Enhancing the acoustic emission technique using fuzzy artificial bee colony-based deep learning for characterizing selective laser melted AlSi10Mg specimens;International Journal of Damage Mechanics;2024-04-30