Refining Localized Attention Features with Multi-Scale Relationships for Enhanced Deepfake Detection in Spatial-Frequency Domain-Reference-Cited by-同舟云学术

Refining Localized Attention Features with Multi-Scale Relationships for Enhanced Deepfake Detection in Spatial-Frequency Domain

Published:2024-05-01 Issue:9 Volume:13 Page:1749
ISSN:2079-9292
Container-title:Electronics
language:en
Short-container-title:Electronics

Author:

Gao Yuan¹²,Zhang Yu¹³,Zeng Ping¹³,Ma Yingjie¹

Affiliation:

1. Department of Electronics and Communications Engineering, Beijing Electronic Science and Technology Institute, Beijing 100070, China

2. State Information Center, Beijing 100045, China

3. School of Telecommunications Engineering, Xidian University, Xi’an 710071, China

Abstract

The rapid advancement of deep learning and large-scale AI models has simplified the creation and manipulation of deepfake technologies, which generate, edit, and replace faces in images and videos. This gradual ease of use has turned the malicious application of forged faces into a significant threat, complicating the task of deepfake detection. Despite the notable success of current deepfake detection methods, which predominantly employ data-driven CNN classification models, these methods exhibit limited generalization capabilities and insufficient robustness against novel data unseen during training. To tackle these challenges, this paper introduces a novel detection framework, ReLAF-Net. This framework employs a restricted self-attention mechanism that applies self-attention to deep CNN features flexibly, facilitating the learning of local relationships and inter-regional dependencies at both fine-grained and global levels. This attention mechanism has a modular design that can be seamlessly integrated into CNN networks to improve overall detection performance. Additionally, we propose an adaptive local frequency feature extraction algorithm that decomposes RGB images into fine-grained frequency domains in a data-driven manner, effectively isolating fake indicators in the frequency space. Moreover, an attention-based channel fusion strategy is developed to amalgamate RGB and frequency information, achieving a comprehensive facial representation. Tested on the high-quality version of the FaceForensics++ dataset, our method attained a detection accuracy of 97.92%, outperforming other approaches. Cross-dataset validation on Celeb-DF, DFDC, and DFD confirms the robust generalizability, offering a new solution for detecting high-quality deepfake videos.

Funder

China Postdoctoral Science Foundation

National Social Science Fund of China

National Natural Science Foundation of China

Publisher

MDPI AG

Link

https://www.mdpi.com/2079-9292/13/9/1749/pdf

Reference49 articles.

1. Generative adversarial nets;Goodfellow;NIPS,2014

2. Afchar, D., Nozick, V., Yamagishi, J., and Echizen, I. (2018, January 11–13). Mesonet: A compact facial video forgery detection network. Proceedings of the 2018 IEEE International Workshop on Information Forensics and Security, Hong Kong, China.

3. Rossler, A., Cozzolino, D., Verdoliva, L., Riess, C., Thies, J., and Nießner, M. (2019–2, January 27). Faceforensics++: Learning to detect manipulated facial images. Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea.

4. Masi, I., Killekar, A., Mascarenhas, R.M., Gurudatt, S.P., and AbdAlmageed, W. (2020, January 23–28). Two-branch recurrent network for isolating deepfakes in videos. Proceedings of the Computer Vision-ECCV 2020: 16th European Conference, Glasgow, UK.

5. Chai, L., Bau, D., Lim, S.N., and Isola, P. (2020, January 23–28). What makes fake images detectable? understanding properties that generalize. Proceedings of the Computer Vision-ECCV 2020: 16th European Conference, Glasgow, UK.