An enhanced speech emotion recognition using vision transformer-Reference-Cited by-同舟云学术

An enhanced speech emotion recognition using vision transformer

Published:2024-06-07 Issue:1 Volume:14 Page:
ISSN:2045-2322
Container-title:Scientific Reports
language:en
Short-container-title:Sci Rep

Author:

Akinpelu Samson,Viriri Serestina,Adegun Adekanmi

Abstract

AbstractIn human–computer interaction systems, speech emotion recognition (SER) plays a crucial role because it enables computers to understand and react to users’ emotions. In the past, SER has significantly emphasised acoustic properties extracted from speech signals. The use of visual signals for enhancing SER performance, however, has been made possible by recent developments in deep learning and computer vision. This work utilizes a lightweight Vision Transformer (ViT) model to propose a novel method for improving speech emotion recognition. We leverage the ViT model’s capabilities to capture spatial dependencies and high-level features in images which are adequate indicators of emotional states from mel spectrogram input fed into the model. To determine the efficiency of our proposed approach, we conduct a comprehensive experiment on two benchmark speech emotion datasets, the Toronto English Speech Set (TESS) and the Berlin Emotional Database (EMODB). The results of our extensive experiment demonstrate a considerable improvement in speech emotion recognition accuracy attesting to its generalizability as it achieved 98%, 91%, and 93% (TESS-EMODB) accuracy respectively on the datasets. The outcomes of the comparative experiment show that the non-overlapping patch-based feature extraction method substantially improves the discipline of speech emotion recognition. Our research indicates the potential for integrating vision transformer models into SER systems, opening up fresh opportunities for real-world applications requiring accurate emotion recognition from speech compared with other state-of-the-art techniques.

Publisher

Springer Science and Business Media LLC

Link

https://www.nature.com/articles/s41598-024-63776-4.pdf

Reference86 articles.

1. Alsabhan, W. Human-computer interaction with a real-time speech emotion recognition with ensembling techniques 1d. Sensors (Switzerland) 23(1386), 1–21. https://doi.org/10.3390/s2303138 (2023).

2. Yahia, A. C., Moussaoui, Frahta, N. & Moussaoui, A. Effective speech emotion recognition using deep learning approaches for Algerian Dialect. In In Proc. Intl. Conf. of Women in Data Science at Taif University, WiDSTaif 1–6 (2021). https://doi.org/10.1109/WIDSTAIF52235.2021.9430224

3. Blackwell, A. Human Computer Interaction-Lecture Notes Cambridge Computer Science Tripos, Part II. https://www.cl.cam.ac.uk/teaching/1011/HCI/HCI2010.pdf (2010)

4. Muthusamy, K. H., Polat, Yaacob, S. Improved emotion recognition using gaussian mixture model and extreme learning machine in speech and glottal signals. Math. Probl. Eng. (2015). https://doi.org/10.1155/2015/394083

5. Xie, J., Zhu, M. & Hu, K. Fusion-based speech emotion classification using two-stage feature selection. Speech Commun. 66(6), 102955. https://doi.org/10.1016/j.specom.2023.102955 (2023).

Cited by 2 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Discriminative feature learning based on multi-view attention network with diffusion joint loss for speech emotion recognition;Engineering Applications of Artificial Intelligence;2024-11

2. Multi-Label Emotion Recognition of Korean Speech Data Using Deep Fusion Models;Applied Sciences;2024-08-28