Temporal Feature Prediction in Audio–Visual Deepfake Detection
-
Published:2024-08-29
Issue:17
Volume:13
Page:3433
-
ISSN:2079-9292
-
Container-title:Electronics
-
language:en
-
Short-container-title:Electronics
Author:
Gao Yuan12, Wang Xuelong1, Zhang Yu13, Zeng Ping13, Ma Yingjie1
Affiliation:
1. Department of Electronics and Communications Engineering, Beijing Electronic Science and Technology Institute, Beijing 100070, China 2. State Information Center, Beijing 100045, China 3. School of Telecommunications Engineering, Xidian University, Xi’an 710071, China
Abstract
The rapid growth of deepfake technology, generating realistic manipulated media, poses a significant threat due to potential misuse. Therefore, effective detection methods are urgently needed to prevent malicious use, as current approaches often focus on single modalities or the simple fusion of audio–visual signals, limiting their accuracy. To solve this problem, we propose a deepfake detection scheme based on bimodal temporal feature prediction, which innovatively introduces the idea of temporal feature prediction into the audio–video bimodal deepfake detection task, aiming at fully exploiting the temporal laws of audio–visual modalities. First, pairs of adjacent audio–video sequence clips are used to construct input quadruples, and a dual-stream network is employed to extract temporal feature representations from video and audio, respectively. A video prediction module and an audio prediction module are designed to capture the temporal inconsistencies within each single modality by predicting future temporal features and comparing them with reference features. Then, a projection layer network is designed to align the audio–visual features, using contrastive loss functions to perform contrastive learning and maximize the differences between real and fake video modalities. Experiments on the FakeAVCeleb dataset demonstrate superior performance with an accuracy of 84.33% and an AUC of 89.91%, outperforming existing methods and confirming the effectiveness of our approach in deepfake detection.
Funder
Fundamental Research Funds for the Central Universities China Postdoctoral Science Foundation OF FUNDER National Social Science Foundation of China OF FUNDER
Reference41 articles.
1. Deferred neural rendering: Image synthesis using neural textures;Thies;ACM Trans. Graph.,2019 2. Jiang, L., Li, R., Wu, W., Qian, C., and Loy, C.C. (2020, January 13–19). Deeperforensics-1.0: A large-scale dataset for real-world face forgery detection. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2020, Seattle, WA, USA. 3. (2020, September 02). Deepfakes. Available online: https://github.com/deepfakes/faceswap. 4. Liu, H., Chen, Z., Yuan, Y., Mei, X., Liu, X., Mandic, D., and Plumbley, M.D. (2023). Audioldm: Text-to-audio generation with latent diffusion models. arXiv. 5. Audeo: Audio generation for a silent performance video;Su;Adv. Neural Inf. Process. Syst.,2020
|
|