Simultaneous Execution of Dereverberation, Denoising, and Speaker Separation Using a Neural Beamformer for Adapting Robots to Real Environments-Reference-Cited by-同舟云学术

Simultaneous Execution of Dereverberation, Denoising, and Speaker Separation Using a Neural Beamformer for Adapting Robots to Real Environments

Published:2022-12-20 Issue:6 Volume:34 Page:1399-1410
ISSN:1883-8049
Container-title:Journal of Robotics and Mechatronics
language:en
Short-container-title:J. Robot. Mechatron.

Author:

Nagano Daichi, ,Nakazawa Kazuo

Abstract

It remains challenging for robots to accurately perform sound source localization and speech recognition in a real environment with reverberation, noise, and the voices of multiple speakers. Accordingly, we propose “U-TasNet-Beam,” a speech extraction method for extracting only the target speaker’s voice from all ambient sounds in a real environment. U-TasNet-Beam is a neural beamformer comprising three elements: a neural network for removing reverberation and noise, a second neural network for separating the voices of multiple speakers, and a minimum variance distortionless response (MVDR) beamformer. Experiments with simulated data and recorded data show that the proposed U-TasNet-Beam can improve the accuracy of sound source localization and speech recognition in robots compared to the conventional methods in a noisy, reverberant, and multi-speaker environment. In addition, we propose the spatial correlation matrix loss (SCM loss) as a loss function for the neural network learning the spatial information of the sound. By using the SCM loss, we can improve the speech extraction performance of the neural beamformer.

Publisher

Fuji Technology Press Ltd.

Subject

Electrical and Electronic Engineering,General Computer Science

Reference24 articles.

1. O. Sugiyama, S. Uemura, A. Nagamine, R. Kojima, K. Nakamura, and K. Nakadai, “Outdoor acoustic event identification with DNN using a quadrotor-embedded microphone array,” J. Robot. Mechatron., Vol.29, No.1, pp. 188-197, 2017.

2. K. Sekiguchi, Y. Bando, K. Itoyama, and K. Yoshii, “Layout optimization of cooperative distributed microphone arrays based on estimation of sound source performance,” J. Robot. Mechatron., Vol.29, No.1, pp. 83-93, 2017.

3. D. D. Lee and H. S. Seung, “Learning the parts of objects by non-negative matrix factorization,” Nature, Vol.401, pp. 788-791, 1999.

4. A. Jansson, E. J. Humphrey, N. Montecchio, R. Bittner, A. Kumar, and T. Weyde, “Singing Voice Separation with Deep U-Net Convolutional Networks,” The 18th Int. Society for Music Information Retrieval Conf., 2017.

5. Y. Luo and N. Mesgarani, “Conv-TasNet: Surpassing Ideal Time-Frequency Magnitude Masking for Speech Separation,” IEEE/ACM Trans. on Audio, Speech, and Language processing, Vol.27, No.8, pp. 1256-1266, 2019.