Affiliation:
1. Faculty of Information Technology, Beijing university of technology, Xidawang Road, Beijing, China
Abstract
The speaker diarization task pertains to the automated differentiation of speakers within an audio recording, while lacking any prior information regarding the speakers. The introduction of the self-attention mechanism in End-to-End Neural Speaker Diarization (EEND) has elegantly resolved the issue of overlapping speakers. The Transformer model equipped with self-attention mechanism has shown great potential in collecting global information, yielding remarkable outcomes in various tasks. However, the individual speaker characteristics are predominantly reflected in the contextual information, which conventional self-attention would not adequately address. In this study, we propose a hierarchical encoders model to augment the encoders’ acquisition of speaker information in two distinct ways: (1) Constraining the perceptual field of the self-attentive mechanism with left-right windows or Gaussian weights to highlight contextual information; (2) Utilizing a pre-trained time-delay neural network based speaker embedding extractor to alleviate the shortcomings of speaker feature extraction ability. We evaluate the proposed methods on a simulated dataset of two speakers and a real conversation dataset. The model with the most favorable outcomes among the proposed enhancements achieves a diarization error rate of 7.74% on the simulated dataset and 21.92% on MagicData-RAMC after adaptation. These results compellingly demonstrate the efficacy of the proposed methods.
Subject
Artificial Intelligence,General Engineering,Statistics and Probability
Reference47 articles.
1. A review of speaker diarization: Recent advances with deep learning;Park;Computer Speech & Language,2022
2. Kanda N. , et al., Acoustic modeling for distant multi-talker speech recognition with single-and multi-channel branches, in ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019, IEEE.
3. Joint speech recognition and speaker diarization via sequence transduction[J];Shafey;arXiv preprint arXiv:1907.05337
4. Simultaneous speech recognition and speaker diarization for monaural dialogue recordings with target-speaker acoustic models[C];Kanda;2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU),2019
5. Bayesian HMM clustering of x-vector sequences (VBx) in speaker diarization: Theory, implementation and analysis on standard tasks;Landini;Computer Speech & Language,2022