A Study of Improved Two-Stage Dual-Conv Coordinate Attention Model for Sound Event Detection and Localization

Author:

Chen Guorong1ORCID,Yu Yuan1ORCID,Qiao Yuan1,Yang Junliang1,Du Chongling1,Qian Zhang1,Huang Xiao2ORCID

Affiliation:

1. School of Intelligent Technology and Engineering, Chongqing University of Science and Technology, No. 20, Daxuecheng East Road, Shapingba District, Chongqing 401331, China

2. Department of Computing, The Hong Kong Polytechnic University, Hong Kong SAR 999077, China

Abstract

Sound Event Detection and Localization (SELD) is a comprehensive task that aims to solve the subtasks of Sound Event Detection (SED) and Sound Source Localization (SSL) simultaneously. The task of SELD lies in the need to solve both sound recognition and spatial localization problems, and different categories of sound events may overlap in time and space, making it more difficult for the model to distinguish between different events occurring at the same time and to locate the sound source. In this study, the Dual-conv Coordinate Attention Module (DCAM) combines dual convolutional blocks and Coordinate Attention, and based on this, the network architecture based on the two-stage strategy is improved to form the SELD-oriented Two-Stage Dual-conv Coordinate Attention Model (TDCAM) for SELD. TDCAM draws on the concepts of Visual Geometry Group (VGG) networks and Coordinate Attention to effectively capture critical local information by focusing on the coordinate space information of the feature map and dealing with the relationship between the feature map channels to enhance the feature selection capability of the model. To address the limitation of a single-layer Bi-directional Gated Recurrent Unit (Bi-GRU) in the two-stage network in terms of timing processing, we add to the structure of the two-layer Bi-GRU and introduce the data enhancement techniques of the frequency mask and time mask to improve the modeling and generalization ability of the model for timing features. Through experimental validation on the TAU Spatial Sound Events 2019 development dataset, our approach significantly improves the performance of SELD compared to the two-stage network baseline model. Furthermore, the effectiveness of DCAM and the two-layer Bi-GRU structure is confirmed by performing ablation experiments.

Funder

Chongqing Technology Innovation and Application Development Project

Chongqing Postgraduate Research and Innovation Program 2023

Publisher

MDPI AG

Reference35 articles.

1. Polyphonic sound event detection based on residual convolutional recurrent neural network with semi-supervised loss function;Kim;IEEE Access,2021

2. Kumar, A., Hegde, R.M., Singh, R., and Raj, B. (2013, January 9–13). Event detection in short duration audio using gaussian mixture model and random forest classifier. Proceedings of the 21st European Signal Processing Conference (EUSIPCO 2013), Marrakech, Morocco.

3. Mesaros, A., Heittola, T., Eronen, A., and Virtanen, T. (2010, January 23–27). Acoustic event detection in real life recordings. Proceedings of the 2010 18th European Signal Processing Conference, Aalborg, Denmark.

4. On Local Temporal Embedding for Semi-Supervised Sound Event Detection;Gao;IEEE/ACM Trans. Audio Speech Lang. Process.,2024

5. Real-time passive source localization: A practical linear-correction least-squares approach;Huang;IEEE Trans. Speech Audio Process.,2001

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3