Abstract
AbstractVisual sentiment analysis is in great demand as it provides a computational method to recognize sentiment information in abundant visual contents from social media sites. Most of existing methods use CNNs to extract varying visual attributes for image sentiment prediction, but they failed to comprehensively consider the correlation among visual components, and are limited by the receptive field of convolutional layers as a result. In this work, we propose a visual semantic correlation network VSCNet, a Transformer-based visual sentiment prediction model. Precisely, global visual features are captured through an extended attention network stacked by a well-designed extended attention mechanism like Transformer. An off-the-shelf object query tool is used to determine the local candidates of potential affective regions, by which redundant and noisy visual proposals are filtered out. All candidates considered affective are embedded into a computable semantic space. Finally, a fusion strategy integrates semantic representations and visual features for sentiment analysis. Extensive experiments reveal that our method outperforms previous studies on 5 annotated public image sentiment datasets without any training tricks. More specifically, it achieves 1.8% higher accuracy on FI benchmark compared with other state-of-the-art methods.
Funder
National Natural Science Foundation of China
Yunnan Province Ten Thousand Talents Program and Yunling Scholars Special Project
Publisher
Springer Science and Business Media LLC
Subject
Computational Mathematics,Engineering (miscellaneous),Information Systems,Artificial Intelligence
Reference48 articles.
1. Bhandari A, Pal NR (2021) Can edges help convolution neural networks in emotion recognition? Neurocomputing 433:162–168. https://doi.org/10.1016/j.neucom.2020.12.092
2. Borth D, Chen T, Ji R, Chang SF (2013) Sentibank: large-scale ontology and classifiers for detecting sentiment and emotions in visual content. In: Proceedings of the 21st ACM international conference on multimedia, association for computing machinery, New York, NY, USA. pp 459-460. https://doi.org/10.1145/2502081.2502268
3. Chen T, Borth D, Darrell T, Chang S (2014) Deepsentibank: visual sentiment concept classification with deep convolutional neural networks. CoRR abs/1410.8586. arXiv:1410.8586
4. Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, Uszkoreit J, Houlsby N (2020) An image is worth $$16\times 16$$ words: transformers for image recognition at scale. CoRR abs/2010.11929. arXiv:2010.11929
5. Guo MH, Lu CZ, Liu ZN, Cheng MM, Hu SM (2023) Visual attention network. Comp Visual Media. https://doi.org/10.1007/s41095-023-0364-2
Cited by
1 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献