Improving Audio Classification Method by Combining Self-Supervision with Knowledge Distillation

Author:

Gong Xuchao1ORCID,Duan Hongjie1,Yang Yaozhong1,Tan Lizhuang23,Wang Jian4ORCID,Vasilakos Athanasios V.5

Affiliation:

1. Artificial Intelligence Research Institute, Shengli Petroleum Management Bureau, Dongying 257000, China

2. Key Laboratory of Computing Power Network and Information Security, Ministry of Education, Shandong Computer Science Center (National Supercomputer Center in Jinan), Qilu University of Technology (Shandong Academy of Sciences), Jinan 250013, China

3. Shandong Provincial Key Laboratory of Computer Networks, Shandong Fundamental Research Center for Computer Science, Jinan 250013, China

4. College of Science, China University of Petroleum (East China), Qingdao 266580, China

5. Department of ICT, Center for AI Research (CAIR), University of Agder (UiA), 4879 Grimstad, Norway

Abstract

The current audio single-mode self-supervised classification mainly adopts a strategy based on audio spectrum reconstruction. Overall, its self-supervised approach is relatively single and cannot fully mine key semantic information in the time and frequency domains. In this regard, this article proposes a self-supervised method combined with knowledge distillation to further improve the performance of audio classification tasks. Firstly, considering the particularity of the two-dimensional audio spectrum, both self-supervised strategy construction is carried out in a single dimension in the time and frequency domains, and self-supervised construction is carried out in the joint dimension of time and frequency. Effectively learn audio spectrum details and key discriminative information through information reconstruction, comparative learning, and other methods. Secondly, in terms of feature self-supervision, two learning strategies for teacher-student models are constructed, which are internal to the model and based on knowledge distillation. Fitting the teacher’s model feature expression ability, further enhances the generalization of audio classification. Comparative experiments were conducted using the AudioSet dataset, ESC50 dataset, and VGGSound dataset. The results showed that the algorithm proposed in this paper has a 0.5% to 1.3% improvement in recognition accuracy compared to the optimal method based on audio single mode.

Funder

National Natural Science Foundation of China

Natural Science Foundation of Shandong Province

Integrated Innovation of Science, Education and Industry of Qilu University of Technology

Talent Project of Qilu University of Technology

Publisher

MDPI AG

Subject

Electrical and Electronic Engineering,Computer Networks and Communications,Hardware and Architecture,Signal Processing,Control and Systems Engineering

Reference46 articles.

1. Panns: Large-scale pretrained audio neural networks for audio pattern recognition;Kong;IEEE ACM Trans. Audio Speech Lang. Process.,2020

2. Hubert: Self-supervised speech representation learning by masked prediction of hidden units;Hsu;IEEE ACM Trans. Audio Speech Lang. Process.,2021

3. Verma, P., and Berger, J. (2021, January 17–20). Audio Transformers: Transformer Architectures for Large Scale Audio Understanding. Proceedings of the IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, New Paltz, NY, USA.

4. Arnault, A., Hanssens, B., and Riche, N. (2020). Urban Sound Classification: Striving towards a fair comparison. arXiv.

5. Gong, Y., Chung, Y.A., and Glass, J. (September, January 30). AST: Audio Spectrogram Transformer. Proceedings of the IEEE Conference on Interspeech, Brno, Czechia.

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3