Affiliation:
1. Agurchand Manmull Jain College, Meenambakkam, Chennai, India
Abstract
Speech Emotion Recognition (SER) is critical in Human computer engagement (HCI) because it provides a deeper knowledge of the situation and leads to better engagement. Various machine learning and Deep Learning (DL) methods have been developed over the past decade to improve SER procedures. In this research, we evaluate the features of speech then offer Speech Former++, a comprehensive structure-based framework for paralinguistic speech processing. Following the component relationship in the speech signal, we propose a unit encoder to efficiently simulate intra- and inter-unit information (i.e., frames, phones, and words). We use merging blocks to generate features at different granularities in accordance with the hierarchy connection, which is consistent with the structural structure in the speech signal. Rather than extracting spatiotemporal information from hand-crafted features, we investigate how to represent the temporal patterns of speech emotions using dynamic temporal scales. To that end, we provide Temporal-aware bI- direction Multi-scale Network (TIM-Net), a unique temporal emotional modelling strategy for SER that learns multi-scale contextual affective representations from different time scales. Unweighted Accuracy (UA) of 65.20% and Weighted Accuracy (WA) of 78.29% are accomplished using signal features in low- and high-level descriptions, as well as various deep neural networks and machine learning approaches.
Reference20 articles.
1. [1] B. Moore, L. Tyler, and W. Marslen-Wilson, “Introduction. The perception of speech: from sound to meaning,” Philosophical transactions of the Royal Society of London. Series B, Biological sciences, vol. 363, no. 1493, pp. 917–921, Mar. 2008.
2. [2] K. Tokuda, T. Kobayashi, and S. Imai, “Speech parameter generation from HMM using dynamic features,” in IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 1, 1995, pp. 660–663.
3. [3] M. Crouse, R. Nowak, and R. Baraniuk, “Wavelet-based statistical signal processing using hidden Markov models,” IEEE Transactions on Signal Processing, vol. 46, no. 4, pp. 886–902, 1998.
4. [4] B. Schuller, G. Rigoll, and M. Lang, “Hidden Markov model-based speech emotion recognition,” in 2003 International Conference on Multimedia and Expo. ICME ’03. Proceedings, vol. 1, 2003, pp. I–401.
5. [5] J. Cichosz and K. Slot, “Emotion recognition in speech signal using emotion-extracting binary decision trees,” Proceedings of affective computing and intelligent interaction, 2007.