Abstract
AbstractThis paper presents a new approach based on recurrent neural networks (RNN) to the multiclass audio segmentation task whose goal is to classify an audio signal as speech, music, noise or a combination of these. The proposed system is based on the use of bidirectional long short-term Memory (BLSTM) networks to model temporal dependencies in the signal. The RNN is complemented by a resegmentation module, gaining long term stability by means of the tied state concept in hidden Markov models. We explore different neural architectures introducing temporal pooling layers to reduce the neural network output sampling rate. Our findings show that removing redundant temporal information is beneficial for the segmentation system showing a relative improvement close to 5%. Furthermore, this solution does not increase the number of parameters of the model and reduces the number of operations per second, allowing our system to achieve a real-time factor below 0.04 if running on CPU and below 0.03 if running on GPU. This new architecture combined with a data-agnostic data augmentation technique called mixup allows our system to achieve competitive results in both the Albayzín 2010 and 2012 evaluation datasets, presenting a relative improvement of 19.72% and 5.35% compared to the best results found in the literature for these databases.
Publisher
Springer Science and Business Media LLC
Subject
Electrical and Electronic Engineering,Acoustics and Ultrasonics
Reference68 articles.
1. T. Theodorou, I. Mporas, N. Fakotakis, An overview of automatic audio segmentation. Int. J. Inf. Technol. Comput. Sci. (IJITCS). 6(11), 1–9 (2014).
2. P. Dhanalakshmi, S. Palanivel, V. Ramalingam, Classification of audio signals using AANN and GMM. Appl. Soft Comput.11(1), 716–723 (2011).
3. M. R. Hasan, M. Jamil, M. Rahman, et al., in 3rd International Conference on Electrical & Computer Engineering (ICECE). Speaker Identification Using Mel Frequency Cepstral Coefficients, (2004), pp. 565–568.
4. E. Wong, S. Sridharan, in Proc. IEEE International Symposium on Intelligent Multimedia, Video and Speech Processing. Comparison of linear prediction cepstrum coefficients and mel-frequency cepstrum coefficients for language identification, (2001), pp. 95–98. https://doi.org/10.1109/isimp.2001.925340.
5. H. -Y. Lo, J. -C. Wang, H. -M. Wang, in IEEE International Conference on Multimedia and Expo (ICME). Homogeneous segmentation and classifier ensemble for audio tag annotation and retrieval, (2010), pp. 304–309. https://doi.org/10.1109/icme.2010.5583009.
Cited by
33 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
1. Speech Feature Extraction in Broadcast Hosting Based on Fluctuating Equation Inversion;Journal of Advanced Computational Intelligence and Intelligent Informatics;2024-07-20
2. Brain Inspired Access Synthetic Consciousness Using A Neural Network Cluster;2024 4th Interdisciplinary Conference on Electrics and Computer (INTCEC);2024-06-11
3. Light Gated Multi Mini-Patch Extractor for Audio Classification;2024 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops (ICASSPW);2024-04-14
4. An Explainable Proxy Model for Multilabel Audio Segmentation;ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP);2024-04-14
5. Automated Classification of Animal Vocalization into Estrus and Non-Estrus Condition using AI Techniques;2023 OITS International Conference on Information Technology (OCIT);2023-12-13