EarSpeech: Exploring In-Ear Occlusion Effect on Earphones for Data-efficient Airborne Speech Enhancement
-
Published:2024-08-22
Issue:3
Volume:8
Page:1-30
-
ISSN:2474-9567
-
Container-title:Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies
-
language:en
-
Short-container-title:Proc. ACM Interact. Mob. Wearable Ubiquitous Technol.
Author:
Han Feiyu1ORCID, Yang Panlong2ORCID, Zuo You3ORCID, Shang Fei3ORCID, Xu Fenglei4ORCID, Li Xiang-Yang3ORCID
Affiliation:
1. Nanjing University of Information Science and Technology, Nanjing, China and University of Science and Technology of China, Hefei, China 2. Nanjing University of Information Science and Technology, Nanjing, China 3. University of Science and Technology of China, Hefei, China 4. Suzhou University of Science and Technology, Jiangsu Industrial Intelligent and Low-carbon Technology Engineering Center, Suzhou, China
Abstract
Earphones have become a popular voice input and interaction device. However, airborne speech is susceptible to ambient noise, making it necessary to improve the quality and intelligibility of speech on earphones in noisy conditions. As the dual-microphone structure (i.e., outer and in-ear microphones) has been widely adopted in earphones (especially ANC earphones), we design EarSpeech which exploits in-ear acoustic sensory as the complementary modality to enable airborne speech enhancement. The key idea of EarSpeech is that in-ear speech is less sensitive to ambient noise and exhibits a correlation with airborne speech. However, due to the occlusion effect, in-ear speech has limited bandwidth, making it challenging to directly correlate with full-band airborne speech. Therefore, we exploit the occlusion effect to carry out theoretical modeling and quantitative analysis of this cross-channel correlation and study how to leverage such cross-channel correlation for speech enhancement. Specifically, we design a series of methodologies including data augmentation, deep learning-based fusion, and noise mixture scheme, to improve the generalization, effectiveness, and robustness of EarSpeech, respectively. Lastly, we conduct real-world experiments to evaluate the performance of our system. Specifically, EarSpeech achieves an average improvement ratio of 27.23% and 13.92% in terms of PESQ and STOI, respectively, and significantly improves SI-SDR by 8.91 dB. Benefiting from data augmentation, EarSpeech can achieve comparable performance with a small-scale dataset that is 40 times less than the original dataset. In addition, we validate the generalization of different users, speech content, and language types, respectively, as well as robustness in the real world via comprehensive experiments. The audio demo of EarSpeech is available on https://github.com/EarSpeech/earspeech.github.io/.
Funder
National Natural Science Foundation of China
Publisher
Association for Computing Machinery (ACM)
Reference67 articles.
1. Speech enhancement with an adaptive Wiener filter;Abd El-Fattah Marwa A;International Journal of Speech Technology,2014 2. Rosana Ardila, Megan Branson, Kelly Davis, Michael Henretty, Michael Kohler, Josh Meyer, Reuben Morais, Lindsay Saunders, Francis M. Tyers, and Gregor Weber. 2020. Common Voice: A Massively-Multilingual Speech Corpus. arXiv:1912.06670 [cs.CL] 3. Deepak Baby and Sarah Verhulst. 2019. Sergan: Speech enhancement using relativistic generative adversarial networks with gradient penalty. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 106--110. 4. HeartPrint;Cao Yetong;Passive Heart Sounds Authentication Exploiting In-Ear Microphones. Heart,2023 5. EarAce: Empowering Versatile Acoustic Sensing via Earable Active Noise Cancellation Platform;Cao Yetong;Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies,2023
|
|