Robust Detection of Background Acoustic Scene in the Presence of Foreground Speech

Author:

Song Siyuan1ORCID,Song Yanjue1ORCID,Madhu Nilesh1ORCID

Affiliation:

1. IDLab, Department of Electronics and Information Systems, Ghent University—imec, 9000 Ghent, Belgium

Abstract

The characterising sound required for the Acoustic Scene Classification (ASC) system is contained in the ambient signal. However, in practice, this is often distorted by e.g., foreground speech of the speakers in the surroundings. Previously, based on the iVector framework, we proposed different strategies to improve the classification accuracy when foreground speech is present. In this paper, we extend these methods to deep-learning (DL)-based ASC systems, for improving foreground speech robustness. ResNet models are proposed as the baseline, in combination with multi-condition training at different signal-to-background ratios (SBRs). For further robustness, we first investigate the noise-floor-based Mel-FilterBank Energies (NF-MFBE) as the input feature of the ResNet model. Next, speech presence information is incorporated within the ASC framework obtained from a speech enhancement (SE) system. As the speech presence information is time-frequency specific, it allows the network to learn to distinguish better between background signal regions and foreground speech. While the proposed modifications improve the performance of ASC systems when foreground speech is dominant, in scenarios with low-level or absent foreground speech, performance is slightly worse. Therefore, as a last consideration, ensemble methods are introduced, to integrate classification scores from different models in a weighted manner. The experimental study systematically validates the contribution of each proposed modification and, for the final system, it is shown that with the proposed input features and meta-learner, the classification accuracy is improved in all tested SBRs. Especially for SBRs of 20 dB, absolute improvements of up to 9% can be obtained.

Publisher

MDPI AG

Reference30 articles.

1. Acoustic Scene Classification: Classifying environments from the sounds they produce;Barchiesi;IEEE Signal Process. Mag.,2015

2. Eronen, A., Tuomi, J., Klapuri, A., Fagerlund, S., Sorsa, T., Lorho, G., and Huopaniemi, J. (2003, January 6–10). Audio-based context awareness-acoustic modeling and perceptual evaluation. Proceedings of the 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP ’03), Hong Kong, China.

3. Audio-based context recognition;Eronen;IEEE Trans. Audio Speech Lang. Process.,2006

4. Elizalde, B., Lei, H., Friedland, G., and Peters, N. (April, January 31). An i-vector based approach for audio scene detection. Proceedings of the IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events, Online.

5. Front-End Factor Analysis for Speaker Verification;Dehak;IEEE Trans. Audio Speech Lang. Process.,2011

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3