Author:
Papadopoulos Stefanos-Iordanis,Koutlis Christos,Papadopoulos Symeon,Petrantonakis Panagiotis C.
Abstract
AbstractMultimedia content has become ubiquitous on social media platforms, leading to the rise of multimodal misinformation (MM) and the urgent need for effective strategies to detect and prevent its spread. In recent years, the challenge of multimodal misinformation detection (MMD) has garnered significant attention by researchers and has mainly involved the creation of annotated, weakly annotated, or synthetically generated training datasets, along with the development of various deep learning MMD models. However, the problem of unimodal bias has been overlooked, where specific patterns and biases in MMD benchmarks can result in biased or unimodal models outperforming their multimodal counterparts on an inherently multimodal task, making it difficult to assess progress. In this study, we systematically investigate and identify the presence of unimodal bias in widely used MMD benchmarks, namely VMU-Twitter and COSMOS. To address this issue, we introduce the “VERification of Image-TExt pairs” (VERITE) benchmark for MMD which incorporates real-world data, excludes “asymmetric multimodal misinformation” and utilizes “modality balancing”. We conduct an extensive comparative study with a transformer-based architecture that shows the ability of VERITE to effectively address unimodal bias, rendering it a robust evaluation framework for MMD. Furthermore, we introduce a new method—termed Crossmodal HArd Synthetic MisAlignment (CHASMA)—for generating realistic synthetic training data that preserve crossmodal relations between legitimate images and false human-written captions. By leveraging CHASMA in the training process, we observe consistent and notable improvements in predictive performance on VERITE; with a 9.2% increase in accuracy. We release our code at: https://github.com/stevejpapad/image-text-verification
Funder
Centre for Research & Technology Hellas
Publisher
Springer Science and Business Media LLC
Reference55 articles.
1. Abdelnabi S, Hasan R, Fritz M (2022) Open-domain, content-based, multi-modal fact-checking of out-of-context images via online resources. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 14940–14949
2. Agrawal A, Batra D, Parikh D, et al (2018) Don’t just assume; look and answer: Overcoming priors for visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4971–4980
3. Alam F, Cresci S, Chakraborty T et al (2022) A survey on multimodal disinformation detection. In: Proceedings of the 29th international conference on computational linguistics, international committee on computational linguistics, pp 6625–6643
4. Aneja S, Bregler C, Niebner M (2023) Cosmos: catching out-of-context image misuse using self-supervised learning. In: Proceedings of the AAAI conference on artificial intelligence, pp 14084–14092
5. Aneja S, Midoglu C, Dang-Nguyen DT, et al (2021b) Mmsys’ 21 grand challenge on detecting cheapfakes. arXiv preprint arXiv:2107.05297
Cited by
1 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献