Sensing to Hear through Memory-Reference-Cited by-同舟云学术

Sensing to Hear through Memory

Published:2024-05-13 Issue:2 Volume:8 Page:1-31
ISSN:2474-9567
Container-title:Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies
language:en
Short-container-title:Proc. ACM Interact. Mob. Wearable Ubiquitous Technol.

Author:

Zhang Qian¹^ORCID,Liu Ke¹^ORCID,Wang Dong¹^ORCID

Affiliation:

1. Shanghai Jiao Tong University, China

Abstract

Speech enhancement on mobile devices is a very challenging task due to the complex environmental noises. Recent works using lip-induced ultrasound signals for speech enhancement open up new possibilities to solve such a problem. However, these multi-modal methods cannot be used in many scenarios where ultrasound-based lip sensing is unreliable or completely absent. In this paper, we propose a novel paradigm that can exploit the prior learned ultrasound knowledge for multi-modal speech enhancement only with the audio input and an additional pre-enrollment speaker embedding. We design a memory network to store the ultrasound memory and learn the interrelationship between the audio and ultrasound modality. During inference, the memory network is able to recall the ultrasound representations from audio input to achieve multi-modal speech enhancement without needing real ultrasound signals. Moreover, we introduce a speaker embedding module to further boost the enhancement performance as well as avoid the degradation of the recalling when the noise level is high. We adopt an end-to-end multi-task manner to train the proposed framework and perform extensive evaluations on the collected dataset. The results show that our method yields comparable performance with audio-ultrasound methods and significantly outperforms the audio-only methods.

Funder

National Natural Science Fund of China

Publisher

Association for Computing Machinery (ACM)

Link

https://dl.acm.org/doi/pdf/10.1145/3659598

Reference67 articles.

1. T. Afouras, J. S. Chung, and A. Zisserman. 2018. The Conversation: Deep Audio-Visual Speech Enhancement. In INTERSPEECH.

2. Ltd Beijing DataTang Technology Co. [n.d.]. aidatatang_200zh a free Chinese Mandarin speech corpus. https://www.datatang.com

3. Suppression of acoustic noise in speech using spectral subtraction

4. Lite Audio-Visual Speech Enhancement

5. J. S. Chung A. Nagrani and A. Zisserman. 2018. VoxCeleb2: Deep Speaker Recognition. In INTERSPEECH.

Cited by 1 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Lipwatch: Enabling Silent Speech Recognition on Smartwatches using Acoustic Sensing;Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies;2024-05-13