Stacked Multiscale Densely Connected Temporal Convolutional Attention Network for Multi-Objective Speech Enhancement in an Airborne Environment-Reference-Cited by-同舟云学术

Stacked Multiscale Densely Connected Temporal Convolutional Attention Network for Multi-Objective Speech Enhancement in an Airborne Environment

Published:2024-02-15 Issue:2 Volume:11 Page:156
ISSN:2226-4310
Container-title:Aerospace
language:en
Short-container-title:Aerospace

Author:

Huang Ping¹^ORCID,Wu Yafeng¹

Affiliation:

1. School of Power and Energy, Northwestern Polytechnical University, Xi’an 710072, China

Abstract

Airborne speech enhancement is always a major challenge for the security of airborne systems. Recently, multi-objective learning technology has become one of the mainstream methods of monaural speech enhancement. In this paper, we propose a novel multi-objective method for airborne speech enhancement, called the stacked multiscale densely connected temporal convolutional attention network (SMDTANet). More specifically, the core of SMDTANet includes three parts, namely a stacked multiscale feature extractor, a triple-attention-based temporal convolutional neural network (TA-TCNN), and a densely connected prediction module. The stacked multiscale feature extractor is leveraged to capture comprehensive feature information from noisy log-power spectra (LPS) inputs. Then, the TA-TCNN adopts a combination of these multiscale features and noisy amplitude modulation spectrogram (AMS) features as inputs to improve its powerful temporal modeling capability. In TA-TCNN, we integrate the advantages of channel attention, spatial attention, and T-F attention to design a novel triple-attention module, which can guide the network to suppress irrelevant information and emphasize informative features of different views. The densely connected prediction module is used to reliably control the flow of the information to provide an accurate estimation of clean LPS and the ideal ratio mask (IRM). Moreover, a new joint-weighted (JW) loss function is constructed to further improve the performance without adding to the model complexity. Extensive experiments on real-world airborne conditions show that our SMDTANet can obtain an on-par or better performance compared to other reference methods in terms of all the objective metrics of speech quality and intelligibility.

Publisher

MDPI AG

Link

https://www.mdpi.com/2226-4310/11/2/156/pdf

Reference43 articles.

1. Suppression of acoustic noise in speech using spectral subtraction;Boll;IEEE Trans. Acoust. Speech Signal Process.,1979

2. All-pole modeling of degraded speech;Lim;IEEE Trans. Acoust. Speech Signal Process.,1978

3. A signal subspace approach for speech enhancement;Ephraim;IEEE Trans. Speech Audio Process.,1995

4. Speech enhancement using progressive learning-based convolutional recurrent neural network;Li;Appl. Acoust.,2020

5. Ren, X., Chen, L., Zheng, X., Xu, C., Zhang, X., Zhang, C., Guo, L., and Yu, B. (2021, January 25–28). A neural beamforming network for b-format 3d speech enhancement and recognition. Proceedings of the 2021 IEEE 31st International Workshop on Machine Learning for Signal Processing (MLSP), Gold Coast, Australia.