Speech Emotion Recognition Using Convolution Neural Networks and Multi-Head Convolutional Transformer

Author:

Ullah Rizwan1,Asif Muhammad2,Shah Wahab Ali3ORCID,Anjam Fakhar2,Ullah Ibrar4ORCID,Khurshaid Tahir5ORCID,Wuttisittikulkij Lunchakorn1,Shah Shashi1ORCID,Ali Syed Mansoor6,Alibakhshikenari Mohammad7

Affiliation:

1. Wireless Communication Ecosystem Research Unit, Department of Electrical Engineering, Chulalongkorn University, Bangkok 10330, Thailand

2. Department of Electrical Engineering, Main Campus, University of Science & Technology, Bannu 28100, Pakistan

3. Department of Electrical Engineering, Namal University, Mianwali 42250, Pakistan

4. Department of Electrical Engineering, Kohat Campus, University of Engineering and Technology Peshawar, Kohat 25000, Pakistan

5. Department of Electrical Engineering, Yeungnam University, Gyeongsan 38541, Republic of Korea

6. Department of Physics and Astronomy, College of Science, King Saud University, P.O. Box 2455, Riyadh 11451, Saudi Arabia

7. Department of Signal Theory and Communications, Universidad Carlos III de Madrid, Leganés, 28911 Madrid, Spain

Abstract

Speech emotion recognition (SER) is a challenging task in human–computer interaction (HCI) systems. One of the key challenges in speech emotion recognition is to extract the emotional features effectively from a speech utterance. Despite the promising results of recent studies, they generally do not leverage advanced fusion algorithms for the generation of effective representations of emotional features in speech utterances. To address this problem, we describe the fusion of spatial and temporal feature representations of speech emotion by parallelizing convolutional neural networks (CNNs) and a Transformer encoder for SER. We stack two parallel CNNs for spatial feature representation in parallel to a Transformer encoder for temporal feature representation, thereby simultaneously expanding the filter depth and reducing the feature map with an expressive hierarchical feature representation at a lower computational cost. We use the RAVDESS dataset to recognize eight different speech emotions. We augment and intensify the variations in the dataset to minimize model overfitting. Additive White Gaussian Noise (AWGN) is used to augment the RAVDESS dataset. With the spatial and sequential feature representations of CNNs and the Transformer, the SER model achieves 82.31% accuracy for eight emotions on a hold-out dataset. In addition, the SER system is evaluated with the IEMOCAP dataset and achieves 79.42% recognition accuracy for five emotions. Experimental results on the RAVDESS and IEMOCAP datasets show the success of the presented SER system and demonstrate an absolute performance improvement over the state-of-the-art (SOTA) models.

Funder

Second Century Fund (C2F), Chulalongkorn University

Universidad Carlos III de Madrid

European Union’s Horizon 2020

King Saud University

Publisher

MDPI AG

Subject

Electrical and Electronic Engineering,Biochemistry,Instrumentation,Atomic and Molecular Physics, and Optics,Analytical Chemistry

Reference87 articles.

1. Speech emotion recognition based on an improved brain emotion learning model;Liu;Neurocomputing,2018

2. Speech emotion recognition using hidden Markov models;Nwe;Speech Commun.,2003

3. Emotion recognition from speech with gaussian mixture models via boosted gmm;Patel;Int. J. Res. Sci. Eng.,2017

4. Speech emotion recognition: Features and classification models;Chen;Digit. Signal Process.,2012

5. Emotion recognition from speech: A review;Koolagudi;Int. J. Speech Technol.,2012

Cited by 2 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3