Iteratively Refined Multi-Channel Speech Separation-Reference-Cited by-同舟云学术

Iteratively Refined Multi-Channel Speech Separation

Published:2024-07-22 Issue:14 Volume:14 Page:6375
ISSN:2076-3417
Container-title:Applied Sciences
language:en
Short-container-title:Applied Sciences

Author:

Zhang Xu¹,Bao Changchun¹,Yang Xue¹,Zhou Jing¹^ORCID

Affiliation:

1. Institute of Speech and Audio Information Processing, School of Information Science and Technology, Beijing University of Technology, Beijing 100124, China

Abstract

The combination of neural networks and beamforming has proven very effective in multi-channel speech separation, but its performance faces a challenge in complex environments. In this paper, an iteratively refined multi-channel speech separation method is proposed to meet this challenge. The proposed method is composed of initial separation and iterative separation. In the initial separation, a time–frequency domain dual-path recurrent neural network (TFDPRNN), minimum variance distortionless response (MVDR) beamformer, and post-separation are cascaded to obtain the first additional input in the iterative separation process. In iterative separation, the MVDR beamformer and post-separation are iteratively used, where the output of the MVDR beamformer is used as an additional input to the post-separation network and the final output comes from the post-separation module. This iteration of the beamformer and post-separation is fully employed for promoting their optimization, which ultimately improves the overall performance. Experiments on the spatialized version of the WSJ0-2mix corpus showed that our proposed method achieved a signal-to-distortion ratio (SDR) improvement of 24.17 dB, which was significantly better than the current popular methods. In addition, the method also achieved an SDR of 20.2 dB on joint separation and dereverberation tasks. These results indicate our method’s effectiveness and significance in the multi-channel speech separation field.

Funder

National Natural Science Foundation of China

Publisher

MDPI AG

Link

https://www.mdpi.com/2076-3417/14/14/6375/pdf

Reference29 articles.

1. Chen, Z., Li, J., Xiao, X., Yoshioka, T., Wang, H., Wang, Z., and Gong, Y. (2017, January 16–20). Cracking the Cocktail Party Problem by Multi-Beam Deep Attractor Network. Proceedings of the 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Okinawa, Japan.

2. Past Review, Current Progress, and Challenges Ahead on the Cocktail Party Problem;Qian;Front. Inf. Technol. Electron. Eng.,2018

3. Chen, J., Mao, Q., and Liu, D. (2020). Dual-Path Transformer Network: Direct Context-Aware Modeling for End-to-End Monaural Speech Separation. arXiv.

4. Conv-TasNet: Surpassing Ideal Time-Frequency Magnitude Masking for Speech Separation;Luo;IEEE/ACM Trans. Audio Speech Lang. Process.,2019

5. Subakan, C., Ravanelli, M., Cornell, S., Bronzi, M., and Zhong, J. (2021, January 6). Attention Is All You Need in Speech Separation. Proceedings of the ICASSP 2021—2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada.