A New U-Net Speech Enhancement Framework Based on Correlation Characteristics of Speech-Reference-Cited by-同舟云学术

A New U-Net Speech Enhancement Framework Based on Correlation Characteristics of Speech

Published:2024-04-09 Issue: Volume: Page:
ISSN:0148-7191
Container-title:SAE Technical Paper Series
language:
Short-container-title:

Author:

Zhang Lijun¹,Pei Kaikun¹,Li Wenbo¹,Meng Dejian¹,He Yinzhi¹

Affiliation:

1. Tongji University

Abstract

<div class="section abstract"><div class="htmlview paragraph">As a key component of in-vehicle intelligent voice technology, speech enhancement can extract clean speech signals contaminated by environmental noise to improve the perceptual quality and intelligibility of speech. It has extensive applications in the field of intelligent car cabins. Although some end-to-end speech enhancement methods based on time domain have been proposed, there is often limited consideration given to designing model architectures based on the characteristics of the speech signal. In this paper, we propose a new U-Net based speech enhancement framework that utilizes the temporal correlation of speech signals to reconstruct higher-quality and more intelligible clean speech. Firstly, to address the issue of inadequate extraction of multi-scale correlation features from speech signals during feature extraction and reconstruction, a novel dense connection multi-scale feature extraction module based on gated dilated convolution is devised to enhance temporal receptive length and extract diverse scale features effectively. Secondly, in order to tackle the problem of feature loss and harmonic distortion during sampling, a sophisticated pooling-reconstruction fine-grained sampling method based on feature map recombination is proposed. This method aims to minimize information loss during down-sampling while simultaneously enhancing the clarity of reconstructed waveforms during up-sampling. Lastly, leveraging the aforementioned pooling-reconstruction sampling method, we propose a deep supervision approach for multi-scale feature. This approach effective supervision of perception characteristics across different frequency ranges. In order to validate the effectiveness of the proposed framework, experiments were conducted on the Voicebank+Demand dataset. The results show that compared to other advanced algorithms, the proposed model significantly improves metrics such as PESQ, STOI, CSIG, CBAK, and COVL. Even in low SNR environments, the enhanced speech signals exhibit noticeable improvements in quality and intelligibility. This is beneficial for subsequent automotive voice applications.</div></div>

Publisher

SAE International

Reference25 articles.

1. Fei , M. , Zhu , X. , and Li , Y. Application and Development of AI Technology in Automobile Intelligent Cockpit 2022 3rd International Conference on Electronic Communication and Artificial Intelligence (IWECAI) 2022 274 280 10.1109/iwecai55315.2022.00059

2. Honda , Y. , Kawamura , A. , and Iiguni , Y. Car Noise Suppression Using Adaptive Noise Canceler with Speech Suppressors Electronics and Communications in Japan 100 12 2017 14 28 10.1002/ECJ.11997

3. Weng , F. , Angkititrakul , P. , Shriberg , E. et al. Conversational In-Vehicle Dialog Systems: The past, present, and future IEEE Signal Processing Magazine 33 6 2016 49 60 10.1109/MSP.2016.2599201

4. LeCun , Y. , Bengio , Y. , and Hinton , G.E. Deep Learning Nature 521 7553 2015 436 444 10.1038/nature14539

5. Xu , Y. , Jun , D. , Dai , L. , and Lee , C.-H. A Regression Approach to Speech Enhancement Based on Deep Neural Networks IEEE/ACM Transactions on Audio, Speech, and Language Processing 23 1 2015 7 19 10.1109/TASLP.2014.2364452