Robust Detection of Background Acoustic Scene in the Presence of Foreground Speech-Reference-Cited by-同舟云学术

Robust Detection of Background Acoustic Scene in the Presence of Foreground Speech

Published:2024-01-10 Issue:2 Volume:14 Page:609
ISSN:2076-3417
Container-title:Applied Sciences
language:en
Short-container-title:Applied Sciences

Author:

Song Siyuan¹^ORCID,Song Yanjue¹^ORCID,Madhu Nilesh¹^ORCID

Affiliation:

1. IDLab, Department of Electronics and Information Systems, Ghent University—imec, 9000 Ghent, Belgium

Abstract

The characterising sound required for the Acoustic Scene Classification (ASC) system is contained in the ambient signal. However, in practice, this is often distorted by e.g., foreground speech of the speakers in the surroundings. Previously, based on the iVector framework, we proposed different strategies to improve the classification accuracy when foreground speech is present. In this paper, we extend these methods to deep-learning (DL)-based ASC systems, for improving foreground speech robustness. ResNet models are proposed as the baseline, in combination with multi-condition training at different signal-to-background ratios (SBRs). For further robustness, we first investigate the noise-floor-based Mel-FilterBank Energies (NF-MFBE) as the input feature of the ResNet model. Next, speech presence information is incorporated within the ASC framework obtained from a speech enhancement (SE) system. As the speech presence information is time-frequency specific, it allows the network to learn to distinguish better between background signal regions and foreground speech. While the proposed modifications improve the performance of ASC systems when foreground speech is dominant, in scenarios with low-level or absent foreground speech, performance is slightly worse. Therefore, as a last consideration, ensemble methods are introduced, to integrate classification scores from different models in a weighted manner. The experimental study systematically validates the contribution of each proposed modification and, for the final system, it is shown that with the proposed input features and meta-learner, the classification accuracy is improved in all tested SBRs. Especially for SBRs of 20 dB, absolute improvements of up to 9% can be obtained.

Publisher

MDPI AG

Link

https://www.mdpi.com/2076-3417/14/2/609/pdf

Reference30 articles.

1. Acoustic Scene Classification: Classifying environments from the sounds they produce;Barchiesi;IEEE Signal Process. Mag.,2015

2. Eronen, A., Tuomi, J., Klapuri, A., Fagerlund, S., Sorsa, T., Lorho, G., and Huopaniemi, J. (2003, January 6–10). Audio-based context awareness-acoustic modeling and perceptual evaluation. Proceedings of the 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP ’03), Hong Kong, China.

3. Audio-based context recognition;Eronen;IEEE Trans. Audio Speech Lang. Process.,2006

4. Elizalde, B., Lei, H., Friedland, G., and Peters, N. (April, January 31). An i-vector based approach for audio scene detection. Proceedings of the IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events, Online.

5. Front-End Factor Analysis for Speaker Verification;Dehak;IEEE Trans. Audio Speech Lang. Process.,2011