Comparative Analysis of Deep Learning Architectures and Vision Transformers for Musical Key Estimation-Reference-Cited by-同舟云学术

Comparative Analysis of Deep Learning Architectures and Vision Transformers for Musical Key Estimation

Published:2023-09-28 Issue:10 Volume:14 Page:527
ISSN:2078-2489
Container-title:Information
language:en
Short-container-title:Information

Author:

Garg Manav¹,Gajjar Pranshav¹^ORCID,Shah Pooja²,Shukla Madhu³,Acharya Biswaranjan³^ORCID,Gerogiannis Vassilis C.⁴^ORCID,Kanavos Andreas⁵^ORCID

Affiliation:

1. Department of Computer Science and Engineering, Institute of Technology, Nirma University, Ahmedabad 382481, Gujarat, India

2. School of Technology, Pandit Deendayal Energy University, Gandhinagar 382426, Gujarat, India

3. Department of Computer Engineering—AI and BDA, Marwadi University, Rajkot 360003, Gujarat, India

4. Department of Digital Systems, University of Thessaly, 41500 Larissa, Greece

5. Department of Informatics, Ionian University, 49100 Corfu, Greece

Abstract

The musical key serves as a crucial element in a piece, offering vital insights into the tonal center, harmonic structure, and chord progressions while enabling tasks such as transposition and arrangement. Moreover, accurate key estimation finds practical applications in music recommendation systems and automatic music transcription, making it relevant across academic and industrial domains. This paper presents a comprehensive comparison between standard deep learning architectures and emerging vision transformers, leveraging their success in various domains. We evaluate their performance on a specific subset of the GTZAN dataset, analyzing six different deep learning models. Our results demonstrate that DenseNet, a conventional deep learning architecture, achieves remarkable accuracy of 91.64%, outperforming vision transformers. However, we delve deeper into the analysis to shed light on the temporal characteristics of each deep learning model. Notably, the vision transformer and SWIN transformer exhibit a slight decrease in overall performance (1.82% and 2.29%, respectively), yet they demonstrate superior performance in temporal metrics compared to the DenseNet architecture. The significance of our findings lies in their contribution to the field of musical key estimation, where accurate and efficient algorithms play a pivotal role. By examining the strengths and weaknesses of deep learning architectures and vision transformers, we can gain valuable insights for practical implementations, particularly in music recommendation systems and automatic music transcription. Our research provides a foundation for future advancements and encourages further exploration in this area.

Funder

Princess Nourah bint Abdulrahman University

King Khalid University

Publisher

MDPI AG

Subject

Information Systems

Link

https://www.mdpi.com/2078-2489/14/10/527/pdf

Reference60 articles.

1. Humphrey, E.J., and Bello, J.P. (2012, January 12–15). Rethinking Automatic Chord Recognition with Convolutional Neural Networks. Proceedings of the 11th IEEE International Conference on Machine Learning and Applications (ICMLA), Boca Raton, FL, USA.

2. Mauch, M., and Dixon, S. (2010, January 9–13). Approximate Note Transcription for the Improved Identification of Difficult Chords. Proceedings of the 11th International Society for Music Information Retrieval Conference (ISMIR), Utrecht, The Netherlands.

3. Temperley, D. (2004). The Cognition of Basic Musical Structures, MIT Press.

4. Tracing the Dynamic Changes in Perceived Tonal Organization in a Spatial Representation of Musical Keys;Krumhansl;Psychol. Rev.,1982

5. Key Estimation in Electronic Dance Music;Faraldo;Advances in Information Retrieval, Proceedings of the 38th European Conference on IR Research (ECIR), Padua, Italy, 20–23 March 2016,2016