3 Directional Inception-ResUNet: deep spatial feature learning for multichannel singing voice separation with distortion-Reference-Cited by-同舟云学术

3 Directional Inception-ResUNet: deep spatial feature learning for multichannel singing voice separation with distortion

Published:2023-07-24 Issue: Volume: Page:
ISSN:
Container-title:
language:
Short-container-title:

Author:

Wang DaDong^ORCID,Wang Jie,Sun MingChen

Abstract

AbstractSinging voice separation on robots faces the problem of interpreting ambiguous auditory signals. The acoustic signal, which the humanoid robot perceives through its onboard microphones, is a mixture of singing voice, music, and noise, with distortion, attenuation, and reverberation. In this paper, we used the 3 directional Inception-Resnet structure in the U-shaped encoding and decoding network to improve the utilization of the spatial and spectral information of the spectrograms.Multi-objectives were used to train the model: magnitude consistency loss, phase consistency loss, and magnitude correlation consistency loss. We recorded the singing voice and accompaniment derived from the MIR-1k datasets with NAO robots and synthesized the 10-channel datasets for training the model. The experimental results show that the proposed model trained by multi-objective reaches an average NSDR of 11.55db on the test datasets, which outperforms the comparison model.Author summaryThe mixture in the real singing voice separation is always mixed with noise and distortion. In this paper, the acoustic signals with distortion and noise perceived by the robot are used to study the separation of singing voices in real scenes. This paper described how to synthesize the training datasets, proposed a 3 directional Inception-ResUNet structure for multichannel singing voice separation, and adopted multi-objectives including magnitude correlation consistency loss to train the model. The experimental results showed that the magnitude correlation consistency loss reduces distortions, the proposed model achieves better performance than the compared models.

Publisher

Cold Spring Harbor Laboratory

Reference49 articles.

1. An overview of machine learning and other data-based methods for spatial audio capture, processing, and reproduction

2. A Consolidated Perspective on Multi-Microphone Speech Enhancement and Source Separation;IEEE/ACM Transactions on Audio, Speech, and Language Processing,2017

3. Supervised Speech Separation Based on Deep Learning: An Overview;IEEE/ACM Transactions on Audio, Speech, and Language Processing,2018

4. Joint optimization of masks and deep recurrent neural networks for monaural source separation;IEEE/ACM Transactions on Audio, Speech, and Language Processing,2015

5. Andrew LM , Quoc VL , Tyler MO , Oriol V , Patrick N , Andrew YN . Recurrent neural networks for noise reduction in robust ASR. In: Proceedings of the 13th conference in the annual series of Interspeech events,Interspeech, 2012. p. 22–25.