Affiliation:
1. Institute of Software Chinese Academy of Sciences, University of Chinese Academy of Sciences, Beijing, China
2. Beijing Key Laboratory of Multimedia and Intelligent Software Technology, Beijing Institute of Artificial Intelligence, Faculty of Information, Beijing University of Technology, Beijing, China
3. Peking University, Beijing, China
Abstract
Efficient recognition of emotions has attracted extensive research interest, which makes new applications in many fields possible, such as human-computer interaction, disease diagnosis, service robots, and so forth. Although existing work on sentiment analysis relying on sensors or unimodal methods performs well for simple contexts like business recommendation and facial expression recognition, it does far below expectations for complex scenes, such as sarcasm, disdain, and metaphors. In this article, we propose a novel two-stage multimodal learning framework, called AMSA, to adaptively learn correlation and complementarity between modalities for dynamic fusion, achieving more stable and precise sentiment analysis results. Specifically, a multiscale attention model with a slice positioning scheme is proposed to get stable quintuplets of sentiment in images, texts, and speeches in the first stage. Then a Transformer-based self-adaptive network is proposed to assign weights flexibly for multimodal fusion in the second stage and update the parameters of the loss function through compensation iteration. To quickly locate key areas for efficient affective computing, a patch-based selection scheme is proposed to iteratively remove redundant information through a novel loss function before fusion. Extensive experiments have been conducted on both machine weakly labeled and manually annotated datasets of self-made Video-SA, CMU-MOSEI, and CMU-MOSI. The results demonstrate the superiority of our approach through comparison with baselines.
Funder
Natural Science Foundation of China
Publisher
Association for Computing Machinery (ACM)
Subject
Computer Networks and Communications,Hardware and Architecture
Reference71 articles.
1. Ensemble of Deep Models for Event Recognition
2. Jessica Elan Chung and Eni Mustafaraj. 2011. Can collective sentiment expressed on twitter predict political elections? In 25th AAAI Conference on Artificial Intelligence.
3. Hang Cui, Vibhu Mittal, and Mayur Datar. 2006. Comparative experiments on sentiment classification for online product reviews. In AAAI, Vol. 6. 30.
4. Multimodal Popularity Prediction of Brand-related Social Media Posts
5. Affective Computing for Large-scale Heterogeneous Multimedia Data
Cited by
7 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
1. Complementary information mutual learning for multimodality medical image segmentation;Neural Networks;2024-09
2. Deep Modular Co-Attention Shifting Network for Multimodal Sentiment Analysis;ACM Transactions on Multimedia Computing, Communications, and Applications;2024-01-11
3. Multimodal Sentiment Analysis for Personality Prediction;2023 International Conference on Frontiers of Information Technology (FIT);2023-12-11
4. Attention U-net for Cell Instance Segmentation;2023 China Automation Congress (CAC);2023-11-17
5. Broad Learning System Based on Fusion Features;Communications in Computer and Information Science;2023-11-05