Introducing phonetic information to speaker embedding for speaker verification

Author:

Liu Yi,He LiangORCID,Liu Jia,Johnson Michael T.

Abstract

AbstractPhonetic information is one of the most essential components of a speech signal, playing an important role for many speech processing tasks. However, it is difficult to integrate phonetic information into speaker verification systems since it occurs primarily at the frame level while speaker characteristics typically reside at the segment level. In deep neural network-based speaker verification, existing methods only apply phonetic information to the frame-wise trained speaker embeddings. To improve this weakness, this paper proposes phonetic adaptation and hybrid multi-task learning and further combines these into c-vector and simplified c-vector architectures. Experiments on National Institute of Standards and Technology (NIST) speaker recognition evaluation (SRE) 2010 show that the four proposed speaker embeddings achieve better performance than the baseline. The c-vector system performs the best, providing over 30% and 15% relative improvements in equal error rate (EER) for the core-extended and 10 s–10 s conditions, respectively. On the NIST SRE 2016, 2018, and VoxCeleb datasets, the proposed c-vector approach improves the performance even when there is a language mismatch within the training sets or between the training and evaluation sets. Extensive experimental results demonstrate the effectiveness and robustness of the proposed methods.

Publisher

Springer Science and Business Media LLC

Subject

Electrical and Electronic Engineering,Acoustics and Ultrasonics

Reference50 articles.

1. D. A. Reynolds, T. F. Quatieri, R. B. Dunn, Speaker verification using adapted gaussian mixture models. Digit. Sig. Process.10(1-3), 19–41 (2000).

2. N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, P. Ouellet, Front-end factor analysis for speaker verification. IEEE Trans. Audio Speech Lang. Process.19(4), 788–798 (2011).

3. D. Snyder, D. Garcia-Romero, D. Povey, S. Khudanpur, in Proc. INTERSPEECH. Deep neural network embeddings for text-independent speaker verification, (2017), pp. 999–1003. https://doi.org/10.21437/interspeech.2017-620.

4. P. Kenny, Joint factor analysis of speaker and session variability: Theory and alogorithms. Technical Report, CRIM-06/08-13 (2008).

5. A. K. Sarkar, D. Matrouf, P. M. Bousquet, J. -F. Bonastre, in Proc. INTERSPEECH. Study of the effect of i-vector modeling on short and mismatch utterance duration for speaker verification (International Speech Communications Association, 2012), pp. 2662–2665. https://www.iscaspeech.org/archive/interspeech_2012/i12_2662.html.

Cited by 19 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

1. A Word-axis Speaker Embedding Trained with Multi-Speaker Analysis Task;2024 Fifteenth International Conference on Ubiquitous and Future Networks (ICUFN);2024-07-02

2. Introducing Multilingual Phonetic Information to Speaker Embedding for Speaker Verification;ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP);2024-04-14

3. Deep speaker embeddings for Speaker Verification: Review and experimental comparison;Engineering Applications of Artificial Intelligence;2024-01

4. Robust End-to-End Diarization with Domain Adaptive Training and Multi-Task Learning;2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU);2023-12-16

5. ADAPT-TTS: HIGH-QUALITY ZERO-SHOT MULTI-SPEAKER TEXT-TO-SPEECH ADAPTIVE-BASED FOR VIETNAMESE;Journal of Computer Science and Cybernetics;2023-06-12

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3