Heterogeneous separation consistency training for adaptation of unsupervised speech separation-Reference-Cited by-同舟云学术

Heterogeneous separation consistency training for adaptation of unsupervised speech separation

Published:2023-01-20 Issue:1 Volume:2023 Page:
ISSN:1687-4722
Container-title:EURASIP Journal on Audio, Speech, and Music Processing
language:en
Short-container-title:J AUDIO SPEECH MUSIC PROC.

Author:

Han Jiangyu,Long Yanhua^ORCID

Abstract

AbstractRecently, supervised speech separation has made great progress. However, limited by the nature of supervised training, most existing separation methods require ground-truth sources and are trained on synthetic datasets. This ground-truth reliance is problematic, because the ground-truth signals are usually unavailable in real conditions. Moreover, in many industry scenarios, the real acoustic characteristics deviate far from the ones in simulated datasets. Therefore, the performance usually degrades significantly when applying the supervised speech separation models to real applications. To address these problems, in this study, we propose a novel separation consistency training, termed SCT, to exploit the real-world unlabeled mixtures for improving cross-domain unsupervised speech separation in an iterative manner, by leveraging upon the complementary information obtained from heterogeneous (structurally distinct but behaviorally complementary) models. SCT follows a framework using two heterogeneous neural networks (HNNs) to produce high confidence pseudo labels of unlabeled real speech mixtures. These labels are then updated and used to refine the HNNs to produce more reliable consistent separation results for real mixture pseudo-labeling. To maximally utilize the large complementary information between different separation networks, a cross-knowledge adaptation is further proposed. Together with simulated dataset, those real mixtures with high confidence pseudo labels are then used to update the HNN separation models iteratively. In addition, we find that combing the heterogeneous separation outputs by a simple linear fusion can further slightly improve the final system performance. In this paper, we use cross-dataset to simulate the cross-domain situation in real-life. The term of “source domain” and “target domain” refer to the simulation set for model pre-training and the real unlabeled mixture for model adaptation. The proposed SCT is evaluated on both public reverberant English and anechoic Mandarin cross-domain separation tasks. Results show that, without any available ground-truth of target domain mixtures, the SCT can still significantly outperform our two strong baselines with up to 1.61 dB and 3.44 dB scale-invariant signal-to-noise ratio (SI-SNR) improvements, on the English and Mandarin cross-domain conditions, respectively.

Funder

National Natural Science Foundation of China

Publisher

Springer Science and Business Media LLC

Subject

Electrical and Electronic Engineering,Acoustics and Ultrasonics

Link

https://link.springer.com/content/pdf/10.1186/s13636-023-00273-y.pdf

Reference56 articles.

1. J.R. Hershey, Z. Chen, J. Le Roux, S. Watanabe, in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Deep clustering: Discriminative embeddings for segmentation and separation (IEEE, 2016), pp. 31–35

2. D. Yu, M. Kolbæk, Z.H. Tan, J. Jensen, in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Permutation invariant training of deep models for speaker-independent multi-talker speech separation (IEEE, 2017), pp. 241–245

3. M. Kolbæk, D. Yu, Z.H. Tan, J. Jensen, Multitalker speech separation with utterance-level permutation invariant training of deep recurrent neural networks. IEEE/ACM Trans. Audio Speech Lang. Process. 25(10), 1901–1913 (2017)

4. Z. Chen, Y. Luo, N. Mesgarani, in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Deep attractor network for single-microphone speaker separation (IEEE, 2017), pp. 246–250

5. Y. Luo, Z. Chen, N. Mesgarani, Speaker-independent speech separation with deep attractor network. IEEE/ACM Trans. Audio Speech Lang. Process. 26(4), 787–796 (2018)

Cited by 1 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Generative Adversarial Networks for Heterogeneous Unsupervised Domain Adaptation Detection;2024 IEEE 28th International Conference on Intelligent Engineering Systems (INES);2024-07-17