Deep learning-based late fusion of multimodal information for emotion classification of music video-Reference-Cited by-同舟云学术

Deep learning-based late fusion of multimodal information for emotion classification of music video

Published:2020-09-17 Issue:2 Volume:80 Page:2887-2905
ISSN:1380-7501
Container-title:Multimedia Tools and Applications
language:en
Short-container-title:Multimed Tools Appl

Author:

Pandeya Yagya Raj,Lee Joonwhoan

Abstract

AbstractAffective computing is an emerging area of research that aims to enable intelligent systems to recognize, feel, infer and interpret human emotions. The widely spread online and off-line music videos are one of the rich sources of human emotion analysis because it integrates the composer’s internal feeling through song lyrics, musical instruments performance and visual expression. In general, the metadata which music video customers to choose a product includes high-level semantics like emotion so that automatic emotion analysis might be necessary. In this research area, however, the lack of a labeled dataset is a major problem. Therefore, we first construct a balanced music video emotion dataset including diversity of territory, language, culture and musical instruments. We test this dataset over four unimodal and four multimodal convolutional neural networks (CNN) of music and video. First, we separately fine-tuned each pre-trained unimodal CNN and test the performance on unseen data. In addition, we train a 1-dimensional CNN-based music emotion classifier with raw waveform input. The comparative analysis of each unimodal classifier over various optimizers is made to find the best model that can be integrate into a multimodal structure. The best unimodal modality is integrated with corresponding music and video network features for multimodal classifier. The multimodal structure integrates whole music video features and makes final classification with the SoftMax classifier by a late feature fusion strategy. All possible multimodal structures are also combined into one predictive model to get the overall prediction. All the proposed multimodal structure uses cross-validation to overcome the data scarcity problem (overfitting) at the decision level. The evaluation results using various metrics show a boost in the performance of the multimodal architectures compared to each unimodal emotion classifier. The predictive model by integration of all multimodal structure achieves 88.56% in accuracy, 0.88 in f1-score, and 0.987 in area under the curve (AUC) score. The result suggests human high-level emotions are automatically well classified in the proposed CNN-based multimodal networks, even though a small amount of labeled data samples is available for training.

Publisher

Springer Science and Business Media LLC

Subject

Computer Networks and Communications,Hardware and Architecture,Media Technology,Software

Link

https://link.springer.com/content/pdf/10.1007/s11042-020-08836-3.pdf

Reference69 articles.

1. Bahuleyan H (2018) Music genre classification using machine learning techniques. arXiv:1804.01149v1

2. Baltrusaitis T, Ahuja C, Morency LP (2018) Multimodal machine learning:a survey and taxonomy. IEEE Trans Pattern Anal Mach Intell 41:423–443

3. Bottou L (2010) Large-scale machine learning with stochastic gradient descent. Springer proceedings of COMPSTAT’2010 177–186

4. Carreira J, and Zisserman A (2018) Quo vadis, action recognition? A new model and the kinetics dataset. arXiv:1705.07750v3

5. Chang WY, Hsu SH, and Chien JH (2017) FATAUVA-net: an integrated deep learning framework for facial attribute recognition, action unit detection, and valence-arousal estimation. IEEE 2160-7516

Cited by 102 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Triple-modality interaction for deepfake detection on zero-shot identity;Information Fusion;2024-09

2. An Audiovisual Correlation Matching Method Based on Fine-Grained Emotion and Feature Fusion;Sensors;2024-08-31

3. A Two-Stage Multi-Modal Multi-Label Emotion Recognition Decision System Based on GCN;International Journal of Decision Support System Technology;2024-08-16

4. A dataset for multimodal music information retrieval of Sotho-Tswana musical videos;Data in Brief;2024-08

5. Improving Nowcasting of Intense Convective Precipitation by Incorporating Dual-Polarization Radar Variables into Generative Adversarial Networks;Sensors;2024-07-28