A Deep Multi-level Attentive Network for Multimodal Sentiment Analysis-Reference-Cited by-同舟云学术

A Deep Multi-level Attentive Network for Multimodal Sentiment Analysis

Published:2023-01-05 Issue:1 Volume:19 Page:1-19
ISSN:1551-6857
Container-title:ACM Transactions on Multimedia Computing, Communications, and Applications
language:en
Short-container-title:ACM Trans. Multimedia Comput. Commun. Appl.

Author:

Yadav Ashima¹,Vishwakarma Dinesh Kumar²

Affiliation:

1. Department of Computer Science and Engineering, Bennett University, Greater Noida, Uttar Pradesh, India

2. Department of Information Technology, Delhi Technological University, Rohini, New Delhi, India

Abstract

Multimodal sentiment analysis has attracted increasing attention with broad application prospects. Most of the existing methods have focused on a single modality, which fails to handle social media data due to its multiple modalities. Moreover, in multimodal learning, most of the works have focused on simply combining the two modalities without exploring the complicated correlations between them. This resulted in dissatisfying performance for multimodal sentiment classification. Motivated by the status quo, we propose a Deep Multi-level Attentive network (DMLANet), which exploits the correlation between image and text modalities to improve multimodal learning. Specifically, we generate the bi-attentive visual map along the spatial and channel dimensions to magnify Convolutional neural network representation power. Then, we model the correlation between the image regions and semantics of the word by extracting the textual features related to the bi-attentive visual features by applying semantic attention. Finally, self-attention is employed to fetch the sentiment-rich multimodal features for the classification automatically. We conduct extensive evaluations on four real-world datasets, namely, MVSA-Single, MVSA-Multiple, Flickr, and Getty Images, which verify our method's superiority.

Publisher

Association for Computing Machinery (ACM)

Subject

Computer Networks and Communications,Hardware and Architecture

Link

https://dl.acm.org/doi/pdf/10.1145/3517139

Reference47 articles.

1. ViLBERT: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks;Lu J.;33rd Conference on Neural Information Processing Systems,2019

2. H. Akbari, L. Yuan, R. Qian, W.-H. Chuang, S.-F. Chang, Y. Cui, and B. Gong. 2021. VATT: Transformers for multimodal self-supervised learning from raw video, audio and text. In 35th Conference on Neural Information Processing Systems.

3. A. Radford, J. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, and J. Clark. 2021. Learning transferable visual models from natural language supervision. In 38th International Conference on Machine Learning.

4. A deep learning architecture of RA-DLNet for visual sentiment analysis

5. Sentiment analysis in medical settings: New opportunities and challenges

Cited by 35 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

2. Innovative Deep Learning-Based CEA-MMSA Framework for Cultural Emotion Analysis of Tamil and Sanskrit Siddha Palm Leaf Manuscripts;2024-08-29

3. DGFN Multimodal Emotion Analysis Model Based on Dynamic Graph Fusion Network;International Journal of Decision Support System Technology;2024-08-16

4. Compact bilinear pooling and multi-loss network for social media multimodal classification;Signal, Image and Video Processing;2024-08-12

5. Multimodal Sentiment Classifier Framework for Different Scene Contexts;Applied Sciences;2024-08-12