Affiliation:
1. Shanghai Jiao Tong University, China
Abstract
Speech enhancement on mobile devices is a very challenging task due to the complex environmental noises. Recent works using lip-induced ultrasound signals for speech enhancement open up new possibilities to solve such a problem. However, these multi-modal methods cannot be used in many scenarios where ultrasound-based lip sensing is unreliable or completely absent. In this paper, we propose a novel paradigm that can exploit the prior learned ultrasound knowledge for multi-modal speech enhancement only with the audio input and an additional pre-enrollment speaker embedding. We design a memory network to store the ultrasound memory and learn the interrelationship between the audio and ultrasound modality. During inference, the memory network is able to recall the ultrasound representations from audio input to achieve multi-modal speech enhancement without needing real ultrasound signals. Moreover, we introduce a speaker embedding module to further boost the enhancement performance as well as avoid the degradation of the recalling when the noise level is high. We adopt an end-to-end multi-task manner to train the proposed framework and perform extensive evaluations on the collected dataset. The results show that our method yields comparable performance with audio-ultrasound methods and significantly outperforms the audio-only methods.
Funder
National Natural Science Fund of China
Publisher
Association for Computing Machinery (ACM)
Reference67 articles.
1. T. Afouras, J. S. Chung, and A. Zisserman. 2018. The Conversation: Deep Audio-Visual Speech Enhancement. In INTERSPEECH.
2. Ltd Beijing DataTang Technology Co. [n.d.]. aidatatang_200zh a free Chinese Mandarin speech corpus. https://www.datatang.com
3. Suppression of acoustic noise in speech using spectral subtraction
4. Lite Audio-Visual Speech Enhancement
5. J. S. Chung A. Nagrani and A. Zisserman. 2018. VoxCeleb2: Deep Speaker Recognition. In INTERSPEECH.
Cited by
1 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
1. Lipwatch: Enabling Silent Speech Recognition on Smartwatches using Acoustic Sensing;Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies;2024-05-13