Affiliation:
1. Computer Science Department, Instituto de Investigaciones en Matematicas Aplicadas y en Sistemas, Universidad Nacional Autonoma de Mexico, Mexico City 3000, Mexico
Abstract
Deep learning-based speech-enhancement techniques have recently been an area of growing interest, since their impressive performance can potentially benefit a wide variety of digital voice communication systems. However, such performance has been evaluated mostly in offline audio-processing scenarios (i.e., feeding the model, in one go, a complete audio recording, which may extend several seconds). It is of significant interest to evaluate and characterize the current state-of-the-art in applications that process audio online (i.e., feeding the model a sequence of segments of audio data, concatenating the results at the output end). Although evaluations and comparisons between speech-enhancement techniques have been carried out before, as far as the author knows, the work presented here is the first that evaluates the performance of such techniques in relation to their online applicability. This means that this work measures how the output signal-to-interference ratio (as a separation metric), the response time, and memory usage (as online metrics) are impacted by the input length (the size of audio segments), in addition to the amount of noise, amount and number of interferences, and amount of reverberation. Three popular models were evaluated, given their availability on public repositories and online viability, MetricGAN+, Spectral Feature Mapping with Mimic Loss, and Demucs-Denoiser. The characterization was carried out using a systematic evaluation protocol based on the Speechbrain framework. Several intuitions are presented and discussed, and some recommendations for future work are proposed.
Subject
Electrical and Electronic Engineering,Biochemistry,Instrumentation,Atomic and Molecular Physics, and Optics,Analytical Chemistry
Reference37 articles.
1. Fundamentals, present and future perspectives of speech enhancement;Das;Int. J. Speech Technol.,2021
2. Supervised Speech Separation Based on Deep Learning: An Overview;Wang;IEEE/ACM Trans. Audio Speech Lang. Process.,2018
3. Front-end speech enhancement for commercial speaker verification systems;Eskimez;Speech Commun.,2018
4. Porov, A., Oh, E., Choo, K., Sung, H., Jeong, J., Osipov, K., and Francois, H. (2018, January 17–20). Music Enhancement by a Novel CNN Architecture. Proceedings of the AES Convention, New York, NY, USA.
5. Improving listeners’ experience for movie playback through enhancing dialogue clarity in soundtracks;Lopatka;Digit. Signal Process.,2016
Cited by
7 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献