A Dual-Path Cross-Modal Network for Video-Music Retrieval-Reference-Cited by-同舟云学术

A Dual-Path Cross-Modal Network for Video-Music Retrieval

Published:2023-01-10 Issue:2 Volume:23 Page:805
ISSN:1424-8220
Container-title:Sensors
language:en
Short-container-title:Sensors

Author:

Gu Xin,Shen Yinghua,Lv Chaohui^ORCID

Abstract

In recent years, with the development of the internet, video has become more and more widely used in life. Adding harmonious music to a video is gradually becoming an artistic task. However, artificially adding music takes a lot of time and effort, so we propose a method to recommend background music for videos. The emotional message of music is rarely taken into account in current work, but it is crucial for video music retrieval. To achieve this, we design two paths to process content information and emotional information between modals. Based on the characteristics of video and music, we design various feature extraction schemes and common representation spaces. In the content path, the pre-trained network is used as the feature extraction network. As these features contain some redundant information, we use an encoder–decoder structure for dimensionality reduction. Where encoder weights are shared to obtain content sharing features for video and music. In the emotion path, an emotion key frames scheme was used for video and a channel attention mechanism was used for music in order to obtain the emotion information effectively. We also added emotion distinguish loss to guarantee that the network acquires the emotion information effectively. More importantly, we propose a way to combine content information with emotional information. That is, content features are first stitched together with sentiment features and then passed through a fused shared space structured as an MLP to obtain more effective fused shared features. In addition, a polarity penalty factor has been added to the classical metric loss function to make it more suitable for this task. Experiments show that this dual path video music retrieval network can effectively merge information. Compared with existing methods, the retrieval task evaluation index increases Recall@1 by 3.94.

Publisher

MDPI AG

Subject

Electrical and Electronic Engineering,Biochemistry,Instrumentation,Atomic and Molecular Physics, and Optics,Analytical Chemistry

Link

https://www.mdpi.com/1424-8220/23/2/805/pdf

Reference40 articles.

1. Hong, S., Woobin, I., and Hyun, S.Y. (2018, January 11–14). Cbvmr: Content-based video-music retrieval using soft intra-modal structure constraint. Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval, Yokohama, Japan.

2. Pretet, L., Richard, G., and Peeters, G. (2021, January 18–22). Cross-Modal Music-Video Recommendation: A Study of Design Choices. Proceedings of the 2021 International Joint Conference on Neural Networks (IJCNN), Shenzhen, China.

3. Prétet, L., Richard, G., and Peeters, G. (2021). “Is there a" language of music-video clips”? A qualitative and quanti-tative study. arXiv.

4. Shin, K.H., and Lee, I.K. (2017, January 13–16). Music synchronization with video using emotion similarity. Proceedings of the 2017 IEEE International Conference on Big Data and Smart Computing (BigComp), Jeju Island, Republic of Korea.

5. Zeng, D., Yu, Y., and Oyama, K. (2018, January 10–12). Audio-Visual Embedding for Cross-Modal Music Video Retrieval through Supervised Deep CCA. Proceedings of the 2018 IEEE International Symposium on Multimedia (ISM), Taichung, Taiwan.

Cited by 1 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. YuYin: a multi-task learning model of multi-modal e-commerce background music recommendation;EURASIP Journal on Audio, Speech, and Music Processing;2023-10-19