Attention-Based Temporal-Frequency Aggregation for Speaker Verification-Reference-Cited by-同舟云学术

Attention-Based Temporal-Frequency Aggregation for Speaker Verification

Published:2022-03-10 Issue:6 Volume:22 Page:2147
ISSN:1424-8220
Container-title:Sensors
language:en
Short-container-title:Sensors

Author:

Wang Meng^ORCID,Feng Dazheng,Su Tingting,Chen Mohan

Abstract

Convolutional neural networks (CNNs) have significantly promoted the development of speaker verification (SV) systems because of their powerful deep feature learning capability. In CNN-based SV systems, utterance-level aggregation is an important component, and it compresses the frame-level features generated by the CNN frontend into an utterance-level representation. However, most of the existing aggregation methods aggregate the extracted features across time and cannot capture the speaker-dependent information contained in the frequency domain. To handle this problem, this paper proposes a novel attention-based frequency aggregation method, which focuses on the key frequency bands that provide more information for utterance-level representation. Meanwhile, two more effective temporal-frequency aggregation methods are proposed in combination with the existing temporal aggregation methods. The two proposed methods can capture the speaker-dependent information contained in both the time domain and frequency domain of frame-level features, thus improving the discriminability of speaker embedding. Besides, a powerful CNN-based SV system is developed and evaluated on the TIMIT and Voxceleb datasets. The experimental results indicate that the CNN-based SV system using the temporal-frequency aggregation method achieves a superior equal error rate of 5.96% on Voxceleb compared with the state-of-the-art baseline models.

Funder

National Natural Science Foundation of China

Publisher

MDPI AG

Subject

Electrical and Electronic Engineering,Biochemistry,Instrumentation,Atomic and Molecular Physics, and Optics,Analytical Chemistry

Link

https://www.mdpi.com/1424-8220/22/6/2147/pdf

Reference42 articles.

1. Forensic Speaker Verification Using Ordinary Least Squares

2. Evaluating the Performance of Speaker Recognition Solutions in E-Commerce Applications

3. Speaker Recognition by Machines and Humans: A tutorial review