Multi scale encoder-decoder network with Time Frequency Attention and S-TCN for single channel speech enhancement-Reference-Cited by-同舟云学术

Multi scale encoder-decoder network with Time Frequency Attention and S-TCN for single channel speech enhancement

Published:2024-04-18 Issue:4 Volume:46 Page:10907-10907
ISSN:1064-1246
Container-title:Journal of Intelligent & Fuzzy Systems
language:
Short-container-title:IFS

Author:

Parisae Veeraswamy¹,Nagakishore Bhavanam S.¹

Affiliation:

1. Electronics and Communication Engineering, Acharya Nagarjuna University, Guntur, India

Abstract

The goal of speech enhancement is to restore clean speech in noisy environments. Acoustic scenarios with low signal-to-noise ratios (SNR) make it quite challenging to extract the target speech from its noise. In the current study, to enhance noisy speech, we propose a feature recalibration based multi-scale convolutional encoder-decoder architecture with squeeze temporal convolutional networks (S-TCN) bottleneck. Each multi-scale convolutional layer in encoder and decoder is followed by time-frequency attention module (TFA). The recalibration based multi-scale 2D convolution layers are used to extract local and contextual information. Additionally, the recalibration network is equipped with a gating mechanism to control the flow of information among the layers, enabling weighting of the scaled features for noise suppression and speech retention. The fully connected layer (FC) in the bottleneck part of encoder-decoder contains a few neurons, which capture the global information from the multi-scale 2D convolution layer and reduce parameters. A S-TCN, inspired by the popular temporal convolutional neural network (TCNN), is inserted between the encoder and the decoder to model long-term dependencies in speech. The TFA is a highly efficient network component, that operates through two simultaneous attentions, one focused on time frames, and the other on frequency channels. These attentions work together to explicitly exploit positional information to create a two-dimensional attention map to effectively capture the significant time-frequency distribution of speech. Utilizing the common voice dataset, our proposed model consistently enhances results compared to the current benchmarks, as demonstrated by two extensively utilized objective measures PESQ and STOI. The proposed model shows significant improvements, with average PESQ and STOI scores increasing by 45.7% and 23.8% respectively for seen background noises, and by 43.5% and 21.4% for unseen background noises, when compared to the quality of noisy speech. Tests validate that the proposed approach outperforms numerous cutting-edge algorithms.

Publisher

IOS Press

Reference18 articles.

1. Suppression of acoustic noise in speech using spectral subtraction;Boll;IEEE Transactions on Acoustics, Speech, and Signal Processing,1979

2. Long short-term memory for speaker generalization in supervised speech separation;Chen;The Journal of the Acoustical Society of America,2017

3. Long short-term memory;Hochreiter;Neural Computation,1997

4. A tandem algorithm for pitch estimation and voiced speech segregation;Hu;IEEE Transactions on Audio, Speech, and Language Processing,2010

5. Weibull and nakagami speech priors based regularized nmf with adaptive wiener filter for speech enhancement;Jannu;International Journal of Speech Technology,2023

Cited by 1 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Speech enhancement using deep complex convolutional neural network (DCCNN) model;Signal, Image and Video Processing;2024-08-14