Author:
Jiang Huawei,Mutahira Husna,Park Unsang,Muhammad Mannan Saeed
Abstract
AbstractA number of remarkable accomplishments have been achieved in the field of audio classification using algorithms based on Transformers in recent years. As addressed in the literature, sound classification commonly involves the analysis of audio recordings that are usually five seconds or longer in duration. This raises a secondary question: Can Transformers effectively classify extremely short audio samples? The main objective of this study is to utilize the Transformer model for sound classification, focusing on extremely brief audio clips, with an average sound duration of $$1.24\times 10^{-2}$$
1.24
×
10
-
2
seconds, which is too short for human recognition. In addition, a new filter is developed to obtain an instantaneous audio dataset. This filter is applied individually to the ESC-50, UrbanSound8K, AESDD, ReaLISED and RAVDESS datasets to obtain corresponding instantaneous datasets. Moreover, a new data augmentation technique is introduced with the objective of increasing classification accuracy. A comparative analysis between the proposed scheme and the mainstream data augmentation methods is performed on the instantaneous audio datasets, resulting in accuracy rates of 94.16%, 96.40%, 70.98%, 89.28%, and 53.51%, respectively. This study has the main advantage of being able to classify sounds efficiently for extremely short audio duration.
Publisher
Springer Science and Business Media LLC
Reference45 articles.
1. Alqudaihi KS, Aslam N, Khan IU, Almuhaideb AM, Alsunaidi SJ, Ibrahim NMAR, Alhaidari FA, Shaikh FS, Alsenbel YM, Alalharith DM, et al. Cough sound detection and diagnosis using artificial intelligence techniques: challenges and opportunities. IEEE Access. 2021;9:102327–44.
2. Arandjelovic R, Zisserman A. Objects that sound. In: Proceedings of the European conference on computer vision (ECCV), 2018:435–451. https://openaccess.thecvf.com/content_ECCV_2018/papers/Relja_Arandjelovic_Objects_that_Sound_ECCV_2018_paper.pdf
3. Chacon-Rodriguez A, Julian P, Castro L, Alvarado P, Hernández N. Evaluation of gunshot detection algorithms. IEEE TCAS-I. 2010;58(2):363–73.
4. Chen K, Du X, Zhu B, Ma Z, Berg-Kirkpatrick T, Dubnov S. Hts-at: A hierarchical token-semantic audio transformer for sound classification and detection. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022
5. pp. 646-650. IEEE (2022). https://ieeexplore.ieee.org/document/9746312