Comparison of Modern Deep Learning Models for Speaker Verification-Reference-Cited by-同舟云学术

Comparison of Modern Deep Learning Models for Speaker Verification

Published:2024-02-06 Issue:4 Volume:14 Page:1329
ISSN:2076-3417
Container-title:Applied Sciences
language:en
Short-container-title:Applied Sciences

Author:

Brydinskyi Vitalii¹²^ORCID,Khoma Yuriy¹²^ORCID,Sabodashko Dmytro¹^ORCID,Podpora Michal³^ORCID,Khoma Volodymyr¹⁴^ORCID,Konovalov Alexander²,Kostiak Maryna¹^ORCID

Affiliation:

1. Institute of Computer Technologies, Automation and Metrology, Lviv Polytechnic National University, Bandery 12, 79013 Lviv, Ukraine

2. Vidby AG, Suurstoffi 8, 6343 Risch-Rotkreuz, Switzerland

3. Department of Computer Science, Opole University of Technology, Proszkowska 76, 45-758 Opole, Poland

4. Department of Control Engineering, Opole University of Technology, Proszkowska 76, 45-758 Opole, Poland

Abstract

This research presents an extensive comparative analysis of a selection of popular deep speaker embedding models, namely WavLM, TitaNet, ECAPA, and PyAnnote, applied in speaker verification tasks. The study employs a specially curated dataset, specifically designed to mirror the real-world operating conditions of voice models as accurately as possible. This dataset includes short, non-English statements gathered from interviews on a popular online video platform. The dataset features a wide range of speakers, with 33 males and 17 females, making a total of 50 unique voices. These speakers vary in age from 20 to 70 years old. This variety helps in thoroughly testing speaker verification models. This dataset is especially useful for research on speaker verification with short recordings. It consists of 10 clips for each person, each clip being no longer than 10 s, adding up to 500 recordings in total. The total length of all recordings is about 1 h and 30 min, which averages to roughly 100 s for each speaker. This dataset is a valuable tool for research in speaker verification, particularly for studies involving short audio clips. The performance of these models is evaluated using common biometric metrics such as false acceptance rate (FAR), false rejection rate (FRR), equal error rate (EER) and detection cost function (DCF). The results reveal that the TitaNet and ECAPA models stand out by presenting the lowest EER (1.91% and 1.71%, respectively) and thus exhibiting higher discriminative features, ensuring, on the one hand, a reduction in intra-class distance (the same speaker), and, on the other hand, maximizing the distance between different speaker embeddings. This analysis also highlights the ECAPA model’s advantageous balance of performance and efficiency, achieving an inference time of 69.43 milliseconds, slightly longer than the PyAnnote models. This study not only compares the performance of models but also provides a comparative analysis of respective model embeddings, offering insights into their strengths and weaknesses. The presented findings serve as a foundation for guiding future research in speaker verification, especially in the context of short audio samples or limited data. This may be particularly relevant for applications requiring quick and accurate speaker identification from short voice clips.

Publisher

MDPI AG

Subject

Fluid Flow and Transfer Processes,Computer Science Applications,Process Chemistry and Technology,General Engineering,Instrumentation,General Materials Science

Link

https://www.mdpi.com/2076-3417/14/4/1329/pdf

Reference45 articles.

1. Speaker recognition based on deep learning: An overview;Bai;Neural Netw.,2021

2. A survey of speaker recognition: Fundamental theories, recognition methods and opportunities;Kabir;IEEE Access,2021

3. Khoma, V., Khoma, Y., Brydinskyi, V., and Konovalov, A. (2023). Development of Supervised Speaker Diarization System Based on the PyAnnote Audio Processing Library. Sensors, 23.

4. Dovydaitis, L., Rasymas, T., and Rudžionis, V. (2016, January 6–8). Speaker authentication system based on voice biometrics and speech recognition. Proceedings of the Business Information Systems Workshops: BIS 2016 International Workshops, Leipzig, Germany.

5. Speaker recognition by machines and humans: A tutorial review;Hansen;IEEE Signal Process. Mag.,2015