Abstract
AbstractPhonetic information is one of the most essential components of a speech signal, playing an important role for many speech processing tasks. However, it is difficult to integrate phonetic information into speaker verification systems since it occurs primarily at the frame level while speaker characteristics typically reside at the segment level. In deep neural network-based speaker verification, existing methods only apply phonetic information to the frame-wise trained speaker embeddings. To improve this weakness, this paper proposes phonetic adaptation and hybrid multi-task learning and further combines these into c-vector and simplified c-vector architectures. Experiments on National Institute of Standards and Technology (NIST) speaker recognition evaluation (SRE) 2010 show that the four proposed speaker embeddings achieve better performance than the baseline. The c-vector system performs the best, providing over 30% and 15% relative improvements in equal error rate (EER) for the core-extended and 10 s–10 s conditions, respectively. On the NIST SRE 2016, 2018, and VoxCeleb datasets, the proposed c-vector approach improves the performance even when there is a language mismatch within the training sets or between the training and evaluation sets. Extensive experimental results demonstrate the effectiveness and robustness of the proposed methods.
Publisher
Springer Science and Business Media LLC
Subject
Electrical and Electronic Engineering,Acoustics and Ultrasonics
Reference50 articles.
1. D. A. Reynolds, T. F. Quatieri, R. B. Dunn, Speaker verification using adapted gaussian mixture models. Digit. Sig. Process.10(1-3), 19–41 (2000).
2. N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, P. Ouellet, Front-end factor analysis for speaker verification. IEEE Trans. Audio Speech Lang. Process.19(4), 788–798 (2011).
3. D. Snyder, D. Garcia-Romero, D. Povey, S. Khudanpur, in Proc. INTERSPEECH. Deep neural network embeddings for text-independent speaker verification, (2017), pp. 999–1003. https://doi.org/10.21437/interspeech.2017-620.
4. P. Kenny, Joint factor analysis of speaker and session variability: Theory and alogorithms. Technical Report, CRIM-06/08-13 (2008).
5. A. K. Sarkar, D. Matrouf, P. M. Bousquet, J. -F. Bonastre, in Proc. INTERSPEECH. Study of the effect of i-vector modeling on short and mismatch utterance duration for speaker verification (International Speech Communications Association, 2012), pp. 2662–2665. https://www.iscaspeech.org/archive/interspeech_2012/i12_2662.html.
Cited by
19 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献