AVaTER: Fusing Audio, Visual, and Textual Modalities Using Cross-Modal Attention for Emotion Recognition

Author:

Das Avishek1ORCID,Sarma Moumita Sen1ORCID,Hoque Mohammed Moshiul1ORCID,Siddique Nazmul2ORCID,Dewan M. Ali Akber3ORCID

Affiliation:

1. Department of Computer Science and Engineering, Chittagong University of Engineering and Technology, Chittagong 4349, Bangladesh

2. School of Computing, Engineering and Intelligent Systems, Ulster University, Belfast BT15 1AP, UK

3. School of Computing and Information Systems, Faculty of Science and Technology, Athabasca University, Athabasca, AB T9S 3A3, Canada

Abstract

Multimodal emotion classification (MEC) involves analyzing and identifying human emotions by integrating data from multiple sources, such as audio, video, and text. This approach leverages the complementary strengths of each modality to enhance the accuracy and robustness of emotion recognition systems. However, one significant challenge is effectively integrating these diverse data sources, each with unique characteristics and levels of noise. Additionally, the scarcity of large, annotated multimodal datasets in Bangla limits the training and evaluation of models. In this work, we unveiled a pioneering multimodal Bangla dataset, MAViT-Bangla (Multimodal Audio Video Text Bangla dataset). This dataset, comprising 1002 samples across audio, video, and text modalities, is a unique resource for emotion recognition studies in the Bangla language. It features emotional categories such as anger, fear, joy, and sadness, providing a comprehensive platform for research. Additionally, we developed a framework for audio, video and textual emotion recognition (i.e., AVaTER) that employs a cross-modal attention mechanism among unimodal features. This mechanism fosters the interaction and fusion of features from different modalities, enhancing the model’s ability to capture nuanced emotional cues. The effectiveness of this approach was demonstrated by achieving an F1-score of 0.64, a significant improvement over unimodal methods.

Funder

Directorate of Research and Extension (DRE), Chittagong University of Engineering & Technology

Publisher

MDPI AG

Reference46 articles.

1. Beard, R., Das, R., Ng, R.W., Gopalakrishnan, P.K., Eerens, L., Swietojanski, P., and Miksik, O. (November, January 31). Multi-modal sequence fusion via recursive attention for emotion recognition. Proceedings of the 22nd Conference on Computational Natural Language Learning, Brussels, Belgium.

2. Multi-class sentiment classification on Bengali social media comments using machine learning;Haque;Int. J. Cogn. Comput. Eng.,2023

3. Islam, K.I., Yuvraz, T., Islam, M.S., and Hassan, E. (2022, January 20–23). Emonoba: A dataset for analyzing fine-grained emotions on noisy bangla texts. Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), Online.

4. Kabir, A., Roy, A., and Taheri, Z. (2023, January 7). BEmoLexBERT: A Hybrid Model for Multilabel Textual Emotion Classification in Bangla by Combining Transformers with Lexicon Features. Proceedings of the First Workshop on Bangla Language Processing (BLP-2023), Singapore.

5. Das, A., Sharif, O., Hoque, M.M., and Sarker, I.H. (2021). Emotion classification in a resource constrained language using transformer-based approach. arXiv.

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3