Comparative Analysis of Deep Learning Architectures and Vision Transformers for Musical Key Estimation

Author:

Garg Manav1,Gajjar Pranshav1ORCID,Shah Pooja2,Shukla Madhu3,Acharya Biswaranjan3ORCID,Gerogiannis Vassilis C.4ORCID,Kanavos Andreas5ORCID

Affiliation:

1. Department of Computer Science and Engineering, Institute of Technology, Nirma University, Ahmedabad 382481, Gujarat, India

2. School of Technology, Pandit Deendayal Energy University, Gandhinagar 382426, Gujarat, India

3. Department of Computer Engineering—AI and BDA, Marwadi University, Rajkot 360003, Gujarat, India

4. Department of Digital Systems, University of Thessaly, 41500 Larissa, Greece

5. Department of Informatics, Ionian University, 49100 Corfu, Greece

Abstract

The musical key serves as a crucial element in a piece, offering vital insights into the tonal center, harmonic structure, and chord progressions while enabling tasks such as transposition and arrangement. Moreover, accurate key estimation finds practical applications in music recommendation systems and automatic music transcription, making it relevant across academic and industrial domains. This paper presents a comprehensive comparison between standard deep learning architectures and emerging vision transformers, leveraging their success in various domains. We evaluate their performance on a specific subset of the GTZAN dataset, analyzing six different deep learning models. Our results demonstrate that DenseNet, a conventional deep learning architecture, achieves remarkable accuracy of 91.64%, outperforming vision transformers. However, we delve deeper into the analysis to shed light on the temporal characteristics of each deep learning model. Notably, the vision transformer and SWIN transformer exhibit a slight decrease in overall performance (1.82% and 2.29%, respectively), yet they demonstrate superior performance in temporal metrics compared to the DenseNet architecture. The significance of our findings lies in their contribution to the field of musical key estimation, where accurate and efficient algorithms play a pivotal role. By examining the strengths and weaknesses of deep learning architectures and vision transformers, we can gain valuable insights for practical implementations, particularly in music recommendation systems and automatic music transcription. Our research provides a foundation for future advancements and encourages further exploration in this area.

Funder

Princess Nourah bint Abdulrahman University

King Khalid University

Publisher

MDPI AG

Subject

Information Systems

Reference60 articles.

1. Humphrey, E.J., and Bello, J.P. (2012, January 12–15). Rethinking Automatic Chord Recognition with Convolutional Neural Networks. Proceedings of the 11th IEEE International Conference on Machine Learning and Applications (ICMLA), Boca Raton, FL, USA.

2. Mauch, M., and Dixon, S. (2010, January 9–13). Approximate Note Transcription for the Improved Identification of Difficult Chords. Proceedings of the 11th International Society for Music Information Retrieval Conference (ISMIR), Utrecht, The Netherlands.

3. Temperley, D. (2004). The Cognition of Basic Musical Structures, MIT Press.

4. Tracing the Dynamic Changes in Perceived Tonal Organization in a Spatial Representation of Musical Keys;Krumhansl;Psychol. Rev.,1982

5. Key Estimation in Electronic Dance Music;Faraldo;Advances in Information Retrieval, Proceedings of the 38th European Conference on IR Research (ECIR), Padua, Italy, 20–23 March 2016,2016

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3