Supervector Extraction for Encoding Speaker and Phrase Information with Neural Networks for Text-Dependent Speaker Verification-Reference-Cited by-同舟云学术

Supervector Extraction for Encoding Speaker and Phrase Information with Neural Networks for Text-Dependent Speaker Verification

Published:2019-08-11 Issue:16 Volume:9 Page:3295
ISSN:2076-3417
Container-title:Applied Sciences
language:en
Short-container-title:Applied Sciences

Author:

Mingote Victoria^ORCID,Miguel Antonio,Ortega Alfonso^ORCID,Lleida Eduardo

Abstract

In this paper, we propose a new differentiable neural network with an alignment mechanism for text-dependent speaker verification. Unlike previous works, we do not extract the embedding of an utterance from the global average pooling of the temporal dimension. Our system replaces this reduction mechanism by a phonetic phrase alignment model to keep the temporal structure of each phrase since the phonetic information is relevant in the verification task. Moreover, we can apply a convolutional neural network as front-end, and, thanks to the alignment process being differentiable, we can train the network to produce a supervector for each utterance that will be discriminative to the speaker and the phrase simultaneously. This choice has the advantage that the supervector encodes the phrase and speaker information providing good performance in text-dependent speaker verification tasks. The verification process is performed using a basic similarity metric. The new model using alignment to produce supervectors was evaluated on the RSR2015-Part I database, providing competitive results compared to similar size networks that make use of the global average pooling to extract embeddings. Furthermore, we also evaluated this proposal on the RSR2015-Part II. To our knowledge, this system achieves the best published results obtained on this second part.

Publisher

MDPI AG

Subject

Fluid Flow and Transfer Processes,Computer Science Applications,Process Chemistry and Technology,General Engineering,Instrumentation,General Materials Science

Link

https://www.mdpi.com/2076-3417/9/16/3295/pdf

Reference29 articles.

1. DeepFace: Closing the Gap to Human-Level Performance in Face Verification

2. FaceNet: A unified embedding for face recognition and clustering

3. Deep feature for text-dependent speaker verification

Cited by 9 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Class token and knowledge distillation for multi-head self-attention speaker verification systems;Digital Signal Processing;2023-03

2. aDCF Loss Function for Deep Metric Learning in End-to-End Text-Dependent Speaker Verification Systems;IEEE/ACM Transactions on Audio, Speech, and Language Processing;2022

3. Log-Likelihood-Ratio Cost Function as Objective Loss for Speaker Verification Systems;Interspeech 2021;2021-08-30

4. Memory Layers with Multi-Head Attention Mechanisms for Text-Dependent Speaker Verification;ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP);2021-06-06

5. Training Speaker Enrollment Models by Network Optimization;Interspeech 2020;2020-10-25