The whole is greater than the sum of its parts: improving music source separation by bridging networks-Reference-Cited by-同舟云学术

The whole is greater than the sum of its parts: improving music source separation by bridging networks

Published:2024-07-19 Issue:1 Volume:2024 Page:
ISSN:1687-4722
Container-title:EURASIP Journal on Audio, Speech, and Music Processing
language:en
Short-container-title:J AUDIO SPEECH MUSIC PROC.

Author:

Sawata Ryosuke^ORCID,Takahashi Naoya,Uhlich Stefan,Takahashi Shusuke,Mitsufuji Yuki

Abstract

AbstractThis paper presents the crossing scheme (X-scheme) for improving the performance of deep neural network (DNN)-based music source separation (MSS) with almost no increasing calculation cost. It consists of three components: (i) multi-domain loss (MDL), (ii) bridging operation, which couples the individual instrument networks, and (iii) combination loss (CL). MDL enables the taking advantage of the frequency- and time-domain representations of audio signals. We modify the target network, i.e., the network architecture of the original DNN-based MSS, by adding bridging paths for each output instrument to share their information. MDL is then applied to the combinations of the output sources as well as each independent source; hence, we called it CL. MDL and CL can easily be applied to many DNN-based separation methods as they are merely loss functions that are only used during training and do not affect the inference step. Bridging operation does not increase the number of learnable parameters in the network. Experimental results showed that the validity of Open-Unmix (UMX), densely connected dilated DenseNet (D3Net) and convolutional time-domain audio separation network (Conv-TasNet) extended with our X-scheme, respectively called X-UMX, X-D3Net and X-Conv-TasNet, by comparing them with their original versions. We also verified the effectiveness of X-scheme in a large-scale data regime, showing its generality with respect to data size. X-UMX Large (X-UMXL), which was trained on large-scale internal data and used in our experiments, is newly available at https://github.com/asteroid-team/asteroid/tree/master/egs/musdb18/X-UMX.

Publisher

Springer Science and Business Media LLC

Link

https://link.springer.com/content/pdf/10.1186/s13636-024-00354-6.pdf

Reference58 articles.

1. Y. Luo, N. Mesgarani, Conv-TasNet: Surpassing ideal time-frequency magnitude masking for speech separation. IEEE/ACM Trans. Audio Speech Lang. Process. 27(8), 1256–1266 (2019)

2. A.V. Oppenheim, R.W. Schafer, J.R. Buck, Discrete-Time Signal Processing, 2nd edn. (Prentice-hall Englewood Cliffs, USA, 1999)

3. G. Meseguer-Brocal, G. Peeters, in Proc. of the 20th International Society for Music Information Retrieval Conference (ISMIR), ed. by A. Flexer, G. Peeters, J. Urbano, A. Volk. Conditioned-U-Net: introducing a control mechanism in the U-Net for multiple source separations (2019). pp. 159–165. http://archives.ismir.net/ismir2019/paper/000017.pdf. Accessed 29 Apr 2024

4. O. Slizovskaia, G. Haro, E. Gómez, Conditioned source separation for musical instrument performances. IEEE/ACM Trans. Audio Speech Lang. Process. 29, 2083–2095 (2021). https://doi.org/10.1109/TASLP.2021.3082331

5. V.S. Kadandale, J.F. Montesinos, G. Haro, E. Gómez, in Proc. of IEEE 22nd International Workshop on Multimedia Signal Processing (MMSP). Multi-channel U-Net for music source separation (2020), pp. 1–6. https://doi.org/10.1109/MMSP48831.2020.9287108

Cited by 1 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Real-Time Low-Latency Music Source Separation Using Hybrid Spectrogram-Tasnet;ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP);2024-04-14