Abstract
Big Five personality classifications often rely on capturing users' facial expressions or other private data. However, in real-life scenarios, individuals may not want their facial expressions recorded due to concerns about accidental data leakage. Furthermore, speech-based personality classification models face new challenges in real-life multilingual environments. We have developed a multimodal Big Five personality classification model that can be applied to multilingual environments. The model relies solely on speech for personality classification. The combination of paralinguistic information from speech and semantic information from transcribed text can provide sufficient information for predicting personality tendencies. The multilingual large-scale pre-trained models, Emotion2vec and Bert, are utilized by the model to process data in speech and text modalities, respectively. The models are trained on the First Impressions monolingual speech dataset and then fine-tuned on the multilingual real dataset, which contains live slices of 512 virtual anchors. The model achieves 60.13% and 52.40% accuracy in low-resource scenarios, respectively. Furthermore, as the length of the audio increases, the accuracy of the model can improve up to 68.86% in real-life scenarios. This potential can be used to develop streaming personality classification models in the future. Personality monitoring has a wide range of applications, including assisting healthcare professionals in providing personalized treatment plans and in consumer psychology to analyze audience segments for businesses.