Abstract
AbstractTechnologies in healthcare, smart homes, security, ecology, and entertainment all deploy audio event detection (AED) in order to detect sound events in an audio recording. Effective AED techniques rely heavily on supervised or semi-supervised models to capture the wide range of dynamics spanned by sound events in order to achieve temporally precise boundaries and accurate event classification. These methods require extensive collections of labeled or weakly labeled in-domain data, which is costly and labor-intensive. Importantly, these approaches do not fully leverage the inherent variability and range of dynamics across sound events, aspects that can be effectively identified through unsupervised methods. The present work proposes an approach based on multi-rate autoencoders that are pretrained in an unsupervised way to leverage unlabeled audio data and ultimately learn the rich temporal dynamics inherent in natural sound events. This approach utilizes parallel autoencoders that achieve decompositions of the modulation spectrum along different bands. In addition, we introduce a rate-selective temporal contrastive loss to align the training objective with event detection metrics. Optimizing the configuration of multi-rate encoders and the temporal contrastive loss leads to notable improvements in domestic sound event detection in the context of the DCASE challenge.
Funder
Office of Naval Research Global
Publisher
Springer Science and Business Media LLC
Cited by
1 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献