Shallow and deep feature fusion for digital audio tampering detection-Reference-Cited by-同舟云学术

Shallow and deep feature fusion for digital audio tampering detection

Published:2022-08-13 Issue:1 Volume:2022 Page:
ISSN:1687-6180
Container-title:EURASIP Journal on Advances in Signal Processing
language:en
Short-container-title:EURASIP J. Adv. Signal Process.

Author:

Wang Zhifeng^ORCID,Yang Yao,Zeng Chunyan,Kong Shuai,Feng Shixiong,Zhao Nan

Abstract

AbstractDigital audio tampering detection can be used to verify the authenticity of digital audio. However, most current methods use standard electronic network frequency (ENF) databases for visual comparison analysis of ENF continuity of digital audio or perform feature extraction for classification by machine learning methods. ENF databases are usually tricky to obtain, visual methods have weak feature representation, and machine learning methods have more information loss in features, resulting in low detection accuracy. This paper proposes a fusion method of shallow and deep features to fully use ENF information by exploiting the complementary nature of features at different levels to more accurately describe the changes in inconsistency produced by tampering operations to raw digital audio. Firstly, the audio signal is band-pass filtered to obtain the ENF component. Then, the discrete Fourier transform (DFT) and Hilbert transform are performed to obtain the phase and instantaneous frequency of the ENF component. Secondly, the mean value of the sequence variation is used as the shallow feature; the feature matrix obtained by framing and reshaping of the ENF sequence is used as the input of the convolutional neural network; the characteristics of the fitted coefficients are obtained by curve fitting. Then, the local details of ENF are obtained from the feature matrix by the convolutional neural network, and the global information of ENF is obtained by fitting coefficient features through deep neural network (DNN). The depth features of ENF are composed of ENF global information and local information together. The shallow and deep features are fused using an attention mechanism to give greater weights to features useful for classification and suppress invalid features. Finally, the tampered audio is detected by downscaling and fitting with a DNN containing two fully connected layers, and classification is performed using a Softmax layer. The method achieves 97.03% accuracy on three classic databases: Carioca 1, Carioca 2, and New Spanish. In addition, we have achieved an accuracy of 88.31% on the newly constructed database GAUDI-DI. Experimental results show that the proposed method is superior to the state-of-the-art method.

Funder

National Natural Science Foundation of China

Publisher

Springer Science and Business Media LLC

Subject

General Medicine

Link

https://link.springer.com/content/pdf/10.1186/s13634-022-00900-4.pdf

Reference28 articles.

1. M.A. Qamhan, H. Altaheri, A.H. Meftah, G. Muhammad, Y.A. Alotaibi, Digital audio forensics: Microphone and environment classification using deep learning. IEEE Access 9, 62719–62733 (2021). https://doi.org/10.1109/access.2021.3073786

2. C. Zeng, D. Zhu, Z. Wang, Z. Wang, N. Zhao, L. He, An end-to-end deep source recording device identification system for web media forensics. Int. J. Web Inf. Syst. 16(4), 413–425 (2020). https://doi.org/10.1108/ijwis-06-2020-0038

3. G. Hua, H. Liao, Q. Wang, H. Zhang, D. Ye, Detection of electric network frequency in audio recordings—from theory to practical detectors. IEEE Trans. Inf. Forensics Secur. 16, 236–248 (2021). https://doi.org/10.1109/tifs.2020.3009579