Audio Anti-Spoofing Based on Audio Feature Fusion
-
Published:2023-06-28
Issue:7
Volume:16
Page:317
-
ISSN:1999-4893
-
Container-title:Algorithms
-
language:en
-
Short-container-title:Algorithms
Author:
Zhang Jiachen1, Tu Guoqing1, Liu Shubo2, Cai Zhaohui2
Affiliation:
1. Key Laboratory of Aerospace Information Security and Trusted Computing Ministry of Education, School of Cyber Science and Engineering, Wuhan University, Wuhan 430072, China 2. School of Computer Science, Wuhan University, Wuhan 430072, China
Abstract
The rapid development of speech synthesis technology has significantly improved the naturalness and human-likeness of synthetic speech. As the technical barriers for speech synthesis are rapidly lowering, the number of illegal activities such as fraud and extortion is increasing, posing a significant threat to authentication systems, such as automatic speaker verification. This paper proposes an end-to-end speech synthesis detection model based on audio feature fusion in response to the constantly evolving synthesis techniques and to improve the accuracy of detecting synthetic speech. The model uses a pre-trained wav2vec2 model to extract features from raw waveforms and utilizes an audio feature fusion module for back-end classification. The audio feature fusion module aims to improve the model accuracy by adequately utilizing the audio features extracted from the front end and fusing the information from timeframes and feature dimensions. Data augmentation techniques are also used to enhance the performance generalization of the model. The model is trained on the training and development sets of the logical access (LA) dataset of the ASVspoof 2019 Challenge, an international standard, and is tested on the logical access (LA) and deep-fake (DF) evaluation datasets of the ASVspoof 2021 Challenge. The equal error rate (EER) on ASVspoof 2021 LA and ASVspoof 2021 DF are 1.18% and 2.62%, respectively, achieving the best results on the DF dataset.
Subject
Computational Mathematics,Computational Theory and Mathematics,Numerical Analysis,Theoretical Computer Science
Reference39 articles.
1. Shen, J., Pang, R., Weiss, R.J., Schuster, M., Jaitly, N., Yang, Z., Chen, Z., Zhang, Y., Wang, Y., and Skerrv-Ryan, R. (2018, January 15–20). Natural TTS synthesis by conditioning wavenet on MEL spectrogram predictions. Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) IEEE, Calgary, AB, Canada. 2. Ping, W., Peng, K., Gibiansky, A., Arik, S.O., Kannan, A., Narang, S., Raiman, J., and Miller, J. (2017). Deep voice 3: Scaling text-to-speech with convolutional sequence learning. arXiv. 3. Li, N., Liu, S., Liu, Y., Zhao, S., and Liu, M. (February, January 27). Neural speech synthesis with transformer network. Proceedings of the AAAI Conference on Artificial Intelligence 2019, Honolulu, HI, USA. 4. Ren, Y., Hu, C., Tan, X., Qin, T., Zhao, S., Zhao, Z., and Liu, T.-Y. (2020). Fastspeech 2: Fast and high-quality end-to-end text to speech. arXiv. 5. ASVspoof 2019: Spoofing countermeasures for the detection of synthesized, converted and replayed speech;Nautsch;IEEE Trans. Biom. Behav. Identity Sci.,2021
Cited by
2 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
|
|