Temporal Feature Prediction in Audio–Visual Deepfake Detection-Reference-Cited by-同舟云学术

Temporal Feature Prediction in Audio–Visual Deepfake Detection

Published:2024-08-29 Issue:17 Volume:13 Page:3433
ISSN:2079-9292
Container-title:Electronics
language:en
Short-container-title:Electronics

Author:

Gao Yuan¹²,Wang Xuelong¹,Zhang Yu¹³,Zeng Ping¹³,Ma Yingjie¹

Affiliation:

1. Department of Electronics and Communications Engineering, Beijing Electronic Science and Technology Institute, Beijing 100070, China

2. State Information Center, Beijing 100045, China

3. School of Telecommunications Engineering, Xidian University, Xi’an 710071, China

Abstract

The rapid growth of deepfake technology, generating realistic manipulated media, poses a significant threat due to potential misuse. Therefore, effective detection methods are urgently needed to prevent malicious use, as current approaches often focus on single modalities or the simple fusion of audio–visual signals, limiting their accuracy. To solve this problem, we propose a deepfake detection scheme based on bimodal temporal feature prediction, which innovatively introduces the idea of temporal feature prediction into the audio–video bimodal deepfake detection task, aiming at fully exploiting the temporal laws of audio–visual modalities. First, pairs of adjacent audio–video sequence clips are used to construct input quadruples, and a dual-stream network is employed to extract temporal feature representations from video and audio, respectively. A video prediction module and an audio prediction module are designed to capture the temporal inconsistencies within each single modality by predicting future temporal features and comparing them with reference features. Then, a projection layer network is designed to align the audio–visual features, using contrastive loss functions to perform contrastive learning and maximize the differences between real and fake video modalities. Experiments on the FakeAVCeleb dataset demonstrate superior performance with an accuracy of 84.33% and an AUC of 89.91%, outperforming existing methods and confirming the effectiveness of our approach in deepfake detection.

Funder

Fundamental Research Funds for the Central Universities

China Postdoctoral Science Foundation OF FUNDER

National Social Science Foundation of China OF FUNDER

Publisher

MDPI AG

Link

https://www.mdpi.com/2079-9292/13/17/3433/pdf

Reference41 articles.

1. Deferred neural rendering: Image synthesis using neural textures;Thies;ACM Trans. Graph.,2019

2. Jiang, L., Li, R., Wu, W., Qian, C., and Loy, C.C. (2020, January 13–19). Deeperforensics-1.0: A large-scale dataset for real-world face forgery detection. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2020, Seattle, WA, USA.

3. (2020, September 02). Deepfakes. Available online: https://github.com/deepfakes/faceswap.

4. Liu, H., Chen, Z., Yuan, Y., Mei, X., Liu, X., Mandic, D., and Plumbley, M.D. (2023). Audioldm: Text-to-audio generation with latent diffusion models. arXiv.

5. Audeo: Audio generation for a silent performance video;Su;Adv. Neural Inf. Process. Syst.,2020