Dual-path transformer-based network with equalization-generation components prediction for flexible vibrational sensor speech enhancement in the time domain
-
Published:2022-05
Issue:5
Volume:151
Page:2814-2825
-
ISSN:0001-4966
-
Container-title:The Journal of the Acoustical Society of America
-
language:en
-
Short-container-title:The Journal of the Acoustical Society of America
Author:
Zheng Changyan1ORCID, Xu Liguo1, Fan Xiaohu1, Yang Jibin2, Fan Junyi2, Huang Xian3
Affiliation:
1. High-tech Institute, Fan Gong-ting South Street on the 12th, Weifang 261000, China 2. Command and Control Engineering College, Army Engineering University, Nanjing 210007, China 3. Department of Biomedical Engineering, Tianjin University, Tianjin 300072, China
Abstract
The flexible vibrational sensor (FVS) has the potential to become a popular wearable communication device because of its natural noise shielding characteristics and soft materials. However, FVS speech faces a severe loss of frequency components. To improve speech quality, a time-domain neural network model based on the dual-path transformer combined with equalization-generation components prediction (DPT-EGNet) is proposed. More specifically, the DPT-EGNet consists of five modules, namely the pre-processing module, dual-path transformer module, equalization module, generation module, and post-processing module. The dual-path transformer module is leveraged to extract the local and global contextual relationship of long-term speech sequences, which is extremely beneficial for inferring the missing components. The equalization and generation modules are designed according to the characteristics of FVS speech, which further improve the speech quality by simulating the inversion process of the speech distortion. The experimental results demonstrate that the proposed model effectively improves the quality of FVS speech; the average perceptual evaluation of speech quality (PESQ), short-time objective intelligibility (STOI), and composite measure for overall speech quality (COVL) scores of three males and three females are relatively increased by 64.19%, 29.63%, and 101.37%, which is superior to other baseline models developed in different domains. The proposed model also has significantly lower complexity than the others.
Funder
National Natural Science Foundation of China Key Research and Development Program of Zhejiang Province
Publisher
Acoustical Society of America (ASA)
Subject
Acoustics and Ultrasonics,Arts and Humanities (miscellaneous)
Reference55 articles.
1. Ba,
J. L.
,
Kiros,
J. R.
, and
Hinton,
G. E.
(2016). “
Layer normalization,” arXiv:1607.06450. 2. Chen,
J.
,
Mao,
Q.
, and
Liu,
D.
(2020). “
Dual-path transformer network: Direct context-aware modeling for end-to-end monaural speech separation,” arXiv:2007.13975. 3. Cho,
K.
,
Van Merriënboer,
B.
,
Gulcehre,
C.
,
Bahdanau,
D.
,
Bougares,
F.
,
Schwenk,
H.
, and
Bengio,
Y.
(2014). “
Learning phrase representations using RNN encoder-decoder for statistical machine translation,” arXiv:1406.1078. 4. Conformable amplified lead zirconate titanate sensors with enhanced piezoelectric response for cutaneous pressure monitoring 5. Dang,
F.
,
Chen,
H.
, and
Zhang,
P.
(2021). “
DPT-FSNet: Dual-path transformer based full-band and sub-band fusion network for speech enhancement,” arXiv:2104.13002.
Cited by
6 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
|
|