Affiliation:
1. College of Computer Science and Technology, Wuhan University of Science and Technology, Wuhan 430065, China
2. Hubei Province Laboratory of Intelligent Information Processing and Real-Time Industrial System, Wuhan 430081, China
Abstract
Emotion recognition in conversations (ERC), which aims to capture the dynamic changes in emotions during conversations, has recently attracted a huge amount of attention due to its importance in providing engaging and empathetic services. Considering that it is difficult for unimodal ERC approaches to capture emotional shifts in conversations, multimodal ERC research is on the rise. However, this still suffers from the following limitations: (1) failing to fully explore richer multimodal interactions and fusion; (2) failing to dynamically model speaker-dependent context in conversations; and (3) failing to employ model-agnostic techniques to eliminate semantic gaps among different modalities. Therefore, we propose a novel hierarchical cross-modal interaction and fusion network enhanced with self-distillation (HCIFN-SD) for ERC. Specifically, HCIFN-SD first proposes three different mask strategies for extracting speaker-dependent cross-modal conversational context based on the enhanced GRU module. Then, the graph-attention-based multimodal fusion (MF-GAT) module constructs three directed graphs for representing different modality spaces, implements in-depth cross-modal interactions for propagating conversational context, and designs a new GNN layer to address over-smoothing. Finally, self-distillation is employed to transfer knowledge from both hard and soft labels to supervise the training process of each student classifier for eliminating semantic gaps between different modalities and improving the representation quality of multimodal fusion. Extensive experimental results on IEMOCAP and MELD demonstrate that HCIFN-SD is superior to the mainstream state-of-the-art baselines by a significant margin.
Funder
National Natural Science Foundation of China
Reference48 articles.
1. The social effects of emotions;Annu. Rev. Psychol.,2022
2. Li, R., Wu, Z., Jia, J., Bu, Y., Zhao, S., and Meng, H. (2019, January 10–16). Towards discriminative representation learning for speech emotion recognition. Proceedings of the 28th International Joint Conference on Artificial Intelligence, Beijing, China.
3. A survey on empathetic dialogue systems;Ma;Inf. Fusion,2020
4. Emotion recognition models for companion robots;Nimmagadda;J. Supercomput.,2022
5. SMFNM: Semi-supervised multimodal fusion network with main-modal for real-time emotion recognition in conversations;Yang;J. King Saud Univ. Comput. Inf. Sci.,2023