Affiliation:
1. Institute of Intelligent and Software Technology, Hangzhou Danzi University, Hangzhou 310018, China
Abstract
Aiming at the shortcomings of single network classification model, this paper applies CNN-LSTM (convolutional neural networks-long short-term memory) combined network in the field of music emotion classification and proposes a multifeature combined network classifier based on CNN-LSTM which combines 2D (two-dimensional) feature input through CNN-LSTM and 1D (single-dimensional) feature input through DNN (deep neural networks) to make up for the deficiencies of original single feature models. The model uses multiple convolution kernels in CNN for 2D feature extraction, BiLSTM (bidirectional LSTM) for serialization processing and is used, respectively, for audio and lyrics single-modal emotion classification output. In the audio feature extraction, music audio is finely divided and the human voice is separated to obtain pure background sound clips; the spectrogram and LLDs (Low Level Descriptors) are extracted therefrom. In the lyrics feature extraction, the chi-squared test vector and word embedding extracted by Word2vec are, respectively, used as the feature representation of the lyrics. Combining the two types of heterogeneous features selected by audio and lyrics through the classification model can improve the classification performance. In order to fuse the emotional information of the two modals of music audio and lyrics, this paper proposes a multimodal ensemble learning method based on stacking, which is different from existing feature-level and decision-level fusion methods, the method avoids information loss caused by direct dimensionality reduction, and the original features are converted into label results for fusion, effectively solving the problem of feature heterogeneity. Experiments on million song dataset show that the audio classification accuracy of the multifeature combined network classifier in this paper reaches 68%, and the lyrics classification accuracy reaches 74%. The average classification accuracy of the multimodal reaches 78%, which is significantly improved compared with the single-modal.
Subject
General Engineering,General Mathematics
Cited by
35 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献