Abstract
AbstractThere has been little work in the literature on the speaker diarization of meetings with multiple distance microphones since the publications in 2012 related to the last National Institute of Standards (NIST) Rich Transcription Evaluation Campaign in 2009 (RT09). Lately, the Second DIHARD Challenge Evaluation has also covered diarization at dinner party meetings that include multiple distant microphones. Dinner party meetings are somehow harder than office meetings because their participants can move freely around the room. In this paper, we studied some of the algorithms on speaker diarization for meetings with multiple distant microphones for the NIST Rich Transcription Evaluation Campaign in 2007 (RT07) and RT09 and provide definite and clear improvements. On the one hand, little or no care has been taken to the problem of penalizing or favoring transitions between speakers other than proposing a minimum duration of a speaker turn or calculating the speakers’ probabilities using Variational Bayes (VB). We have studied this issue and determined that a transition penalty term is needed that should be independent both of the number of active speakers and the minimum duration of speaker turns. On the other hand, the determination of a method to automatically select the right number of parameters is crucial in developing good models for speakers. Previous studies have proposed the dynamic selection of the number of parameters based on the duration of the speaker’s speech with a mixed performance when tested at one distant microphone meetings or multiple distant microphones meetings. In this paper, we propose a new method that takes into account both the duration of speaker’s speech to determine a minimum number of parameters, and the question of overfitting issue to determine a maximum number of them, also taking into account the computation time in order to reduce it.We have carried out experiments to support our findings, and we have been able to improve our baseline speaker error rate with multiple distant-microphone meetings. Both methods achieve improved performance over the baseline. The first method obtains a 21.6% decrease in relative speaker error for the development set and a 4.6% decrease in relative speaker error for the test set (RT09). The second method obtains a 46.47% decrease in relative speaker error for the development set and a 17.54% decrease in relative speaker error for the test set. Both methods complement each other, and when they are applied in combination, we obtain a 47.2% decrease in relative speaker error for the development set and a 22.02% decrease in relative speaker error for the test set.The performance obtained with our proposal is outstanding in some subsets of the development test such as the NIST RT07 and among the best for RT09 using our proposed simple modifications. Furthermore, with our algorithm we obtain gains in computation time without jeopardizing performance. Results with a different publicly available database, augmented multiparty interaction (AMI) obtains a 28.44% decrease in relative speaker error confirming the validity of our methods. Preliminary experiments with a single stream (mfcc) endorse the validity of our findings. Comparisons with an x-vector system deliver superior performance of our system on unseen test data.
Funder
Ministerio de Economia y Empresa
Publisher
Springer Science and Business Media LLC
Subject
Electrical and Electronic Engineering,Acoustics and Ultrasonics
Reference53 articles.
1. S. Tranter, D. Reynolds, An overview of automatic speaker diarization systems. IEEE. Trans. Audio. Speech. Lang. Process. 14(5), 1557–1565 (2006)
2. X. Anguera, S. Bozonnet, N. Evans, C. Fredouille, G. Friedland, O. Vinyals, Speaker diarization: a review of recent research. IEEE. Trans. Audio. Speech. Lang. Process. 20(2), 356–370 (2012)
3. M.H. Moattar, M.M. Homayounpour, A review on speaker diarization systems and approaches. Speech. Comm. 54, 1065–1103 (2012)
4. The Third DIHARD Speech Diarization Challenge, [Online]. Available: https://dihardchallenge.github.io/dihard3/index. [Accessed 11 October 2020].
5. X. Anguera, M. Aguiló, C. Wooters, C. Nadeu, J. Hernando, in Proceedings of Speaker Odyssey. Hybrid speech/non-speech detector applied to speaker diarization of meetings (2006)
Cited by
1 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
1. Cost Model and Algorithm Analysis of Food Emergency Logistics Distribution;2023 IEEE International Conference on Integrated Circuits and Communication Systems (ICICACS);2023-02-24