Source Acquisition Device Identification from Recorded Audio Based on Spatiotemporal Representation Learning with Multi-Attention Mechanisms-Reference-Cited by-同舟云学术

Source Acquisition Device Identification from Recorded Audio Based on Spatiotemporal Representation Learning with Multi-Attention Mechanisms

Published:2023-04-06 Issue:4 Volume:25 Page:626
ISSN:1099-4300
Container-title:Entropy
language:en
Short-container-title:Entropy

Author:

Zeng Chunyan¹^ORCID,Feng Shixiong¹,Zhu Dongliang²,Wang Zhifeng³^ORCID

Affiliation:

1. Hubei Key Laboratory for High-Efficiency Utilization of Solar Energy and Operation Control of Energy Storage System, Hubei University of Technology, Wuhan 430068, China

2. National Engineering Research Center for Multimedia Software, School of Computer Science, Wuhan University, Wuhan 430072, China

3. Department of Digital Media Technology, Central China Normal University, Wuhan 430079, China

Abstract

Source acquisition device identification from recorded audio aims to identify the source recording device by analyzing the intrinsic characteristics of audio, which is a challenging problem in audio forensics. In this paper, we propose a spatiotemporal representation learning framework with multi-attention mechanisms to tackle this problem. In the deep feature extraction stage of recording devices, a two-branch network based on residual dense temporal convolution networks (RD-TCNs) and convolutional neural networks (CNNs) is constructed. The spatial probability distribution features of audio signals are employed as inputs to the branch of the CNN for spatial representation learning, and the temporal spectral features of audio signals are fed into the branch of the RD-TCN network for temporal representation learning. This achieves simultaneous learning of long-term and short-term features to obtain an accurate representation of device-related information. In the spatiotemporal feature fusion stage, three attention mechanisms—temporal, spatial, and branch attention mechanisms—are designed to capture spatiotemporal weights and achieve effective deep feature fusion. The proposed framework achieves state-of-the-art performance on the benchmark CCNU_Mobile dataset, reaching an accuracy of 97.6% for the identification of 45 recording devices, with a significant reduction in training time compared to other models.

Publisher

MDPI AG

Subject

General Physics and Astronomy

Link

https://www.mdpi.com/1099-4300/25/4/626/pdf

Reference37 articles.

1. An end-to-end deep source recording device identification system for Web media forensics;Zeng;Int. J. Web Inf. Syst.,2020

2. Audio forensic examination;Maher;IEEE Signal Process. Mag.,2009

3. Shallow and Deep Feature Fusion for Digital Audio Tampering Detection;Wang;EURASIP J. Adv. Signal Process.,2022

4. Audio Tampering Forensics Based on Representation Learning of ENF Phase Sequence;Zeng;Int. J. Digit. Crime Forensics,2022

5. Band Energy Difference for Source Attribution in Audio Forensics;Luo;IEEE Trans. Inf. Forensics Secur.,2018

Cited by 6 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Squeeze-and-Excitation Self-Attention Mechanism Enhanced Digital Audio Source Recognition Based on Transfer Learning;Circuits, Systems, and Signal Processing;2024-09-13

2. ENFformer: Long-short term representation of electric network frequency for digital audio tampering detection;Knowledge-Based Systems;2024-08

3. Discriminative Component Analysis Enhanced Feature Fusion of Electrical Network Frequency for Digital Audio Tampering Detection;Circuits, Systems, and Signal Processing;2024-07-26

4. Digital audio tampering detection based on spatio-temporal representation learning of electrical network frequency;Multimedia Tools and Applications;2024-03-27

5. Deletion and insertion tampering detection for speech authentication based on fluctuating super vector of electrical network frequency;Speech Communication;2024-03