End-to-end speech emotion recognition using a novel context-stacking dilated convolution neural network-Reference-Cited by-同舟云学术

End-to-end speech emotion recognition using a novel context-stacking dilated convolution neural network

Published:2021-05-12 Issue:1 Volume:2021 Page:
ISSN:1687-4722
Container-title:EURASIP Journal on Audio, Speech, and Music Processing
language:en
Short-container-title:J AUDIO SPEECH MUSIC PROC.

Author:

Tang Duowei^ORCID,Kuppens Peter,Geurts Luc,van Waterschoot Toon

Abstract

AbstractAmongst the various characteristics of a speech signal, the expression of emotion is one of the characteristics that exhibits the slowest temporal dynamics. Hence, a performant speech emotion recognition (SER) system requires a predictive model that is capable of learning sufficiently long temporal dependencies in the analysed speech signal. Therefore, in this work, we propose a novel end-to-end neural network architecture based on the concept of dilated causal convolution with context stacking. Firstly, the proposed model consists only of parallelisable layers and is hence suitable for parallel processing, while avoiding the inherent lack of parallelisability occurring with recurrent neural network (RNN) layers. Secondly, the design of a dedicated dilated causal convolution block allows the model to have a receptive field as large as the input sequence length, while maintaining a reasonably low computational cost. Thirdly, by introducing a context stacking structure, the proposed model is capable of exploiting long-term temporal dependencies hence providing an alternative to the use of RNN layers. We evaluate the proposed model in SER regression and classification tasks and provide a comparison with a state-of-the-art end-to-end SER model. Experimental results indicate that the proposed model requires only 1/3 of the number of model parameters used in the state-of-the-art model, while also significantly improving SER performance. Further experiments are reported to understand the impact of using various types of input representations (i.e. raw audio samples vs log mel-spectrograms) and to illustrate the benefits of an end-to-end approach over the use of hand-crafted audio features. Moreover, we show that the proposed model can efficiently learn intermediate embeddings preserving speech emotion information.

Funder

China Scholarship Council

Onderzoeksraad, KU Leuven

European Research Council

Publisher

Springer Science and Business Media LLC

Subject

Electrical and Electronic Engineering,Acoustics and Ultrasonics

Link

https://link.springer.com/content/pdf/10.1186/s13636-021-00208-5.pdf

Reference49 articles.

1. J. A. Russell, Core affect and the psychological construction of emotion. Psychol. Rev.110(1), 145–172 (2003).

2. F. Eyben, K. R. Scherer, B. W. Schuller, J. Sundberg, E. André, C. Busso, L. Y. Devillers, J. Epps, P. Laukka, S. S. Narayanan, K. P. Truong, The Geneva Minimalistic Acoustic Parameter Set (GEMAPS) for voice research and affective comput. IEEE Trans. Affect. Comput.7(2), 190–202 (2016).

3. F. Ringeval, F. Eyben, E. Kroupi, A. Yuce, J. P. Thiran, T. Ebrahimi, D. Lalanne, B. Schuller, Prediction of asynchronous dimensional emotion ratings from audiovisual and physiological data. Pattern Recognit. Lett.66:, 22–30 (2015).

4. B. Schuller, B. Vlasenko, F. Eyben, G. Rigoll, A. Wendemuth, in Proc. 2009 IEEE Work. Autom. Speech Recognit. Understanding (ASRU 2009). Acoustic emotion recognition: a benchmark comparison of performances (IEEEMerano, 2009), pp. 552–557.

5. Z. Liu, M. Wu, W. Cao, J. Mao, J. Xu, G. Tan, Speech emotion recognition based on feature selection and extreme learning machine decision tree. Neuro Comput.273:, 271–280 (2018).

Cited by 16 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Beyond superficial emotion recognition: Modality-adaptive emotion recognition system;Expert Systems with Applications;2024-01

2. Hilbert Domain Analysis of Wavelet Packets for Emotional Speech Classification;Circuits, Systems, and Signal Processing;2023-12-06

3. A Multiscale Dynamic Temporal Convolution Network For Continuous Dimensional Emotion Recognition;2023 International Joint Conference on Neural Networks (IJCNN);2023-06-18

4. Recommendation of Music Based on Facial Emotion using Machine Learning Technique;Advances in Computational Intelligence in Materials Science;2023-06-07

5. Paralinguistic and spectral feature extraction for speech emotion classification using machine learning techniques;EURASIP Journal on Audio, Speech, and Music Processing;2023-05-15