Fusing Visual Attention CNN and Bag of Visual Words for Cross-Corpus Speech Emotion Recognition-Reference-Cited by-同舟云学术

Fusing Visual Attention CNN and Bag of Visual Words for Cross-Corpus Speech Emotion Recognition

Published:2020-09-28 Issue:19 Volume:20 Page:5559
ISSN:1424-8220
Container-title:Sensors
language:en
Short-container-title:Sensors

Author:

Seo Minji^ORCID,Kim Myungho

Abstract

Speech emotion recognition (SER) classifies emotions using low-level features or a spectrogram of an utterance. When SER methods are trained and tested using different datasets, they have shown performance reduction. Cross-corpus SER research identifies speech emotion using different corpora and languages. Recent cross-corpus SER research has been conducted to improve generalization. To improve the cross-corpus SER performance, we pretrained the log-mel spectrograms of the source dataset using our designed visual attention convolutional neural network (VACNN), which has a 2D CNN base model with channel- and spatial-wise visual attention modules. To train the target dataset, we extracted the feature vector using a bag of visual words (BOVW) to assist the fine-tuned model. Because visual words represent local features in the image, the BOVW helps VACNN to learn global and local features in the log-mel spectrogram by constructing a frequency histogram of visual words. The proposed method shows an overall accuracy of 83.33%, 86.92%, and 75.00% in the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS), the Berlin Database of Emotional Speech (EmoDB), and Surrey Audio-Visual Expressed Emotion (SAVEE), respectively. Experimental results on RAVDESS, EmoDB, SAVEE demonstrate improvements of 7.73%, 15.12%, and 2.34% compared to existing state-of-the-art cross-corpus SER approaches.

Publisher

MDPI AG

Subject

Electrical and Electronic Engineering,Biochemistry,Instrumentation,Atomic and Molecular Physics, and Optics,Analytical Chemistry

Link

https://www.mdpi.com/1424-8220/20/19/5559/pdf

Reference66 articles.

1. Review of Three-Dimensional Human-Computer Interaction with Focus on the Leap Motion Controller

2. Speech emotion recognition: Emotional models, databases, features, preprocessing methods, supporting modalities, and classifiers

3. Facial expression recognition techniques: a comprehensive survey

4. Emotion classification based on brain wave: a survey

5. Biometric Recognition Using Deep Learning: A survey;Minaee;arXiv,2019

Cited by 27 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Unveiling hidden factors: explainable AI for feature boosting in speech emotion recognition;Applied Intelligence;2024-05-31

2. Deep neural network architectures for audio emotion recognition performed on song and speech modalities;International Journal of Speech Technology;2023-12

3. Cross Corpus Speech Emotion Recognition using transfer learning and attention-based fusion of Wav2Vec2 and prosody features;Knowledge-Based Systems;2023-10

4. A medical text classification approach with ZEN and capsule network;The Journal of Supercomputing;2023-09-13

5. Study of Speech Emotion Recognition Using Blstm with Attention;2023 31st European Signal Processing Conference (EUSIPCO);2023-09-04