Abstract
The task of emotion recognition in dialogues is crucial for constructing empathetic machines. Current research primarily focuses on learning emotion-related common features in multimodal data. However, it does not adequately address various dependency information of emotional features in dialogues. This oversight may lead to lower accuracy in multimodal emotion recognition and inability to recognize emotion in real time. To address this problem, we propose a contextualized approach using enhanced Relational Graph Attention Network and GraphTransformer for multimodal emotion recognition. This model employs Transformer to capture the global information between modalities. It then constructs a heterogeneous graph using the extracted global features and employs enhanced RGAT and GraphTransformer to model the complex dependencies in a conversation. Finally, a reinforcement learning algorithm is used to implement a real-time emotion recognition model. Extensive experiments on two benchmark datasets indicate that CRRGM achieves state-of-the-art performance.