Affiliation:
1. School of Computer Science and Engineering Shandong University of Science and Technology Qingdao China
2. State Key Laboratory of Virtual Reality Technology and Systems Beihang University Beijing China
3. Virtual Reality Research Institute Beihang University Qingdao Research Institute Qingdao China
4. School of Software Beihang University Beijing China
Abstract
AbstractPersonality recognition is of great significance in deepening the understanding of social relations. While personality recognition methods have made significant strides in recent years, the challenge of heterogeneity between modalities during feature fusion still needs to be solved. This paper introduces an adaptive multi‐modal information fusion network (AMIF‐Net) capable of concurrently processing video, audio, and text data. First, utilizing the AMIF‐Net encoder, we process the extracted audio and video features separately, effectively capturing long‐term data relationships. Then, adding adaptive elements in the fusion network can alleviate the problem of heterogeneity between modes. Lastly, we concatenate audio‐video and text features into a regression network to obtain Big Five personality trait scores. Furthermore, we introduce a novel loss function to address the problem of training inaccuracies, taking advantage of its unique property of exhibiting a peak at the critical mean. Our tests on the ChaLearn First Impressions V2 multi‐modal dataset show partial performance surpassing state‐of‐the‐art networks.
Funder
National Natural Science Foundation of China
National Science and Technology Major Project
Natural Science Foundation of Beijing Municipality
Shandong University of Science and Technology