Affiliation:
1. School of Communication and Information Engineering, Shanghai University, Shanghai 200444, China
Abstract
To overcome the limitations of traditional methods in reverberant and noisy environments, a robust multi-scale fusion neural network with attention mask is designed to improve direction-of-arrival (DOA) estimation accuracy for acoustic sources. It combines the benefits of deep learning and complex-valued operations to effectively deal with the interference of reverberation and noise in speech signals. The unique properties of complex-valued signals are exploited to fully capture inherent features and rich information is preserved in the complex field. An attention mask module is designed to generate distinct masks for selectively focusing and masking based on the input. After that, the multi-scale fusion block efficiently captures multi-scale spatial features by stacking complex-valued convolutional layers with small size kernels, and reduces the module complexity through special branching operations. Experimental results demonstrate that the model achieves significant improvements over other methods for speaker localization in reverberant and noisy environments. It provides a new solution for DOA estimation for acoustic sources in different scenarios, which has significant theoretical and practical implications.
Reference60 articles.
1. A Multiscale Autoencoder (MSAE) Framework for End-to-End Neural Network Speech Enhancement;Brandstein;IEEE/ACM Trans. Audio Speech Lang. Process.,2024
2. Leveraging Non-Causal Knowledge Via Cross-Network Knowledge Distillation for Real-Time Speech Enhancement;Park;IEEE Signal Process. Lett.,2024
3. Lee, Y., Choi, S., Kim, B.-Y., Wang, Z.-Q., and Watanabe, S. (2024, January 14–19). Boosting Unknown-Number Speaker Separation with Transformer Decoder-Based Attractor. Proceedings of the ICASSP 2024—2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea.
4. Reverberant Source Separation Using NTF With Delayed Subsources and Spatial Priors;Kowalczyk;IEEE/ACM Trans. Audio Speech Lang. Process.,2024
5. Unified Cross-Modal Attention: Robust Audio-Visual Speech Recognition and Beyond;Li;IEEE/ACM Trans. Audio Speech Lang. Process.,2024