Joint Audio-Visual Attention with Contrastive Learning for More General Deepfake Detection-Reference-Cited by-同舟云学术

Joint Audio-Visual Attention with Contrastive Learning for More General Deepfake Detection

Published:2024-01-22 Issue:5 Volume:20 Page:1-23
ISSN:1551-6857
Container-title:ACM Transactions on Multimedia Computing, Communications, and Applications
language:en
Short-container-title:ACM Trans. Multimedia Comput. Commun. Appl.

Author:

Zhang Yibo¹^ORCID,Lin Weiguo¹^ORCID,Xu Junfeng¹^ORCID

Affiliation:

1. Communication University of China, China

Abstract

With the continuous advancement of deepfake technology, there has been a surge in the creation of realistic fake videos. Unfortunately, the malicious utilization of deepfake poses a significant threat to societal morality and political security. Therefore, numerous researchers have proposed various deepfake detection methods. However, traditional deepfake approaches tend to focus on specific forgery features, such as artifacts or inconsistent actions, which can be vulnerable to specialized countermeasures. Recent studies show an intrinsic correlation between facial and audio cues, which can be exploited for deepfake detection. To address these challenges and enhance the robustness and generalization of deepfake detection algorithms, we propose a novel joint audio-visual deepfake detection model named AVA-CL, which is capable of detecting deepfakes in both audio and visual domains. Furthermore, exploiting the inherent correlation and consistency between audio and visual enhances the effectiveness of deepfake detection significantly. Through extensive experiments, we demonstrate that our proposed AVA-CL model outperforms many state-of-the-art (SOTA) methods with superior robustness and generalization capabilities. This research presents a promising approach for deepfake detection and reducing the harm caused by malicious use.

Funder

National Key Research and Development Program

Publisher

Association for Computing Machinery (ACM)

Subject

Computer Networks and Communications,Hardware and Architecture

Link

https://dl.acm.org/doi/pdf/10.1145/3625100

Reference55 articles.

1. MesoNet: a Compact Facial Video Forgery Detection Network

2. Detecting Deep-Fake Videos from Phoneme-Viseme Mismatches

3. Zhixi Cai Shreya Ghosh Kalin Stefanov Abhinav Dhall Jianfei Cai Hamid Rezatofighi Reza Haffari and Munawar Hayat. 2023. MARLIN: Masked Autoencoder for facial video Representation LearnINg. arXiv:2211.06627. Retrieved from https://arxiv.org/abs/2211.06627

4. Do You Really Mean That? Content Driven Audio-Visual Deepfake Dataset and Multimodal Method for Temporal Forgery Localization

5. Everybody Dance Now

Cited by 2 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. A Comprehensive Comparative Analysis of Deepfake Detection Techniques in Visual, Audio, and Audio-Visual Domains;2024 Intelligent Methods, Systems, and Applications (IMSA);2024-07-13

2. Detecting Deepfake Voices Using a Novel Method for Authenticity Verification in Voice-Based Communication;Lecture Notes in Networks and Systems;2024