ResSKNet-SSDP: Effective and Light End-To-End Architecture for Speaker Recognition
Author:
Deng Fei1, Deng Lihong1ORCID, Jiang Peifan1, Zhang Gexiang23, Yang Qiang3
Affiliation:
1. College of Computer Science and Cyber Security (Oxford Brookes College), Chengdu University of Technology, Chengdu 610059, China 2. Artificial Intelligence Research Center, Chengdu University of Technology, Chengdu 610059, China 3. School of Control Engineering, Chengdu University of Information Engineering, Chengdu 610059, China
Abstract
In speaker recognition tasks, convolutional neural network (CNN)-based approaches have shown significant success. Modeling the long-term contexts and efficiently aggregating the information are two challenges in speaker recognition, and they have a critical impact on system performance. Previous research has addressed these issues by introducing deeper, wider, and more complex network architectures and aggregation methods. However, it is difficult to significantly improve the performance with these approaches because they also have trouble fully utilizing global information, channel information, and time-frequency information. To address the above issues, we propose a lighter and more efficient CNN-based end-to-end speaker recognition architecture, ResSKNet-SSDP. ResSKNet-SSDP consists of a residual selective kernel network (ResSKNet) and self-attentive standard deviation pooling (SSDP). ResSKNet can capture long-term contexts, neighboring information, and global information, thus extracting a more informative frame-level. SSDP can capture short- and long-term changes in frame-level features, aggregating the variable-length frame-level features into fixed-length, more distinctive utterance-level features. Extensive comparison experiments were performed on two popular public speaker recognition datasets, Voxceleb and CN-Celeb, with current state-of-the-art speaker recognition systems and achieved the lowest EER/DCF of 2.33%/0.2298, 2.44%/0.2559, 4.10%/0.3502, and 12.28%/0.5051. Compared with the lightest x-vector, our designed ResSKNet-SSDP has 3.1 M fewer parameters and 31.6 ms less inference time, but 35.1% better performance. The results show that ResSKNet-SSDP significantly outperforms the current state-of-the-art speaker recognition architectures on all test sets and is an end-to-end architecture with fewer parameters and higher efficiency for applications in realistic situations. The ablation experiments further show that our proposed approaches also provide significant improvements over previous methods.
Funder
National Natural Science Foundation of China Sichuan Science and Technology Program
Subject
Electrical and Electronic Engineering,Biochemistry,Instrumentation,Atomic and Molecular Physics, and Optics,Analytical Chemistry
Reference46 articles.
1. Leonardis, A., Bischof, H., and Pinz, A. (2006, January 7–13). Probabilistic Linear Discriminant Analysis. Proceedings of the European Conference on Computer Vision 2006, Graz, Austria. Lecture Notes in Computer Science. 2. Front-End Factor Analysis for Speaker Verification;Dehak;IEEE Trans. Audio Speech Lang. Process.,2011 3. Cai, W., Chen, J., and Li, M. (2018, January 26–29). Exploring the Encoding Layer and Loss Function in End-to-End Speaker and Language Recognition System. Proceedings of the Speaker and Language Recognition Workshop (Odyssey 2018), Les Sables d’Olonne, France. 4. Variani, E., Lei, X., McDermott, E., Moreno, I.L., and Gonzalez-Dominguez, J. (2014, January 4–9). Deep Neural Networks for Small Footprint Text-Dependent Speaker Verification. Proceedings of the 2014 IEEE International Conference on Acoustics, Speech, and Signal Processing, Florence, Italy. 5. He, K., Zhang, X., Ren, S., and Sun, J. (2016, January 27–30). Deep Residual Learning for Image Recognition. Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA.
Cited by
3 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
|
|