Abstract
AbstractIn most Music Emotion Recognition (MER) tasks, researchers tend to use supervised learning models based on music features and corresponding annotation. However, few researchers have considered applying unsupervised learning approaches to labeled data except for feature representation. In this paper, we propose a segment-based two-stage model combining unsupervised learning and supervised learning. In the first stage, we split each music excerpt into contiguous segments and then utilize an autoencoder to generate segment-level feature representation. In the second stage, we feed these time-series music segments to a bidirectional long short-term memory deep learning model to achieve the final music emotion classification. Compared with the whole music excerpts, segments as model inputs could be the proper granularity for model training and augment the scale of training samples to reduce the risk of overfitting during deep learning. Apart from that, we also apply frequency and time masking to segment-level inputs in the unsupervised learning part to enhance training performance. We evaluate our model on two datasets. The results show that our model outperforms state-of-the-art models, some of which even use multimodal architectures. And the performance comparison also evidences the effectiveness of audio segmentation and the autoencoder with masking in an unsupervised way.
Funder
University of Technology Sydney
Publisher
Springer Science and Business Media LLC
Subject
Library and Information Sciences,Media Technology,Information Systems
Reference48 articles.
1. Aljanaki A, Wiering F, Veltkamp RC (2015) Emotion based segmentation of musical audio. In: Proceedings of the 16th international society for music information retrieval conference, ISMIR 2015, pp 770–776
2. Aljanaki A, Yang YH, Soleymani M (2017) Developing a benchmark for emotional analysis of music. PLoS ONE. https://doi.org/10.1371/journal.pone.0173392
3. Bigand E, Vieillard S, Madurell F et al (2005) Multidimensional scaling of emotional responses to music: the effect of musical expertise and of the duration of the excerpts. Cognit Emot 19(8):1113–1139. https://doi.org/10.1080/02699930500204250
4. Choi K, Fazekas G, Sandler M (2016) Automatic tagging using deep convolutional neural networks. In: Proceedings of the 17th international society for music information retrieval conference, ISMIR 2016, pp 805–811
5. Choi K, Fazekas G, Sandler M, et al (2017) Convolutional recurrent neural networks for music classification. In: ICASSP, IEEE international conference on acoustics, speech and signal processing—proceedings, pp 2392–2396. https://doi.org/10.1109/ICASSP.2017.7952585
Cited by
12 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献