Multimodal Pre-training for Sequential Recommendation via Contrastive Learning-Reference-Cited by-同舟云学术

Multimodal Pre-training for Sequential Recommendation via Contrastive Learning

Published:2024-07-29 Issue: Volume: Page:
ISSN:2770-6699
Container-title:ACM Transactions on Recommender Systems
language:en
Short-container-title:ACM Trans. Recomm. Syst.

Author:

Zhang Lingzi¹^ORCID,Zhou Xin²^ORCID,Zeng Zhiwei²^ORCID,Shen Zhiqi²^ORCID

Affiliation:

1. Nanyang Technological University, Singapore, Singapore

2. Nanyang Technological University, Singapore Singapore

Abstract

Sequential recommendation systems often suffer from data sparsity, leading to suboptimal performance. While multimodal content, such as images and text, has been utilized to mitigate this issue, its integration within sequential recommendation frameworks remains challenging. Current multimodal sequential recommendation models are often unable to effectively explore and capture correlations among behavior sequences of users and items across different modalities, either neglecting correlations among sequence representations or inadequately capturing associations between multimodal data and sequence data in their representations. To address this problem, we explore multimodal pre-training in the context of sequential recommendation, with the aim of enhancing fusion and utilization of multimodal information. We propose a novel M ultimodal P re-training for S equential R ecommendation (MP4SR) framework, which utilizes contrastive losses to capture the correlation among different modality sequences of users, as well as the correlation among different modality sequences of users and items. MP4SR consists of three key components: 1) multimodal feature extraction, 2) a backbone network, Multimodal Mixup Sequence Encoder (M 2 SE), and 3) pre-training tasks. After utilizing pre-trained encoders to generate initial multimodal features of items, M 2 SE adopts a complementary sequence mixup strategy to fuse different modality sequences, and leverages contrastive learning to capture modality interactions at the sequence-to-sequence and sequence-to-item levels. Extensive experiments on four real-world datasets demonstrate that MP4SR outperforms state-of-the-art approaches in both normal and cold-start settings. We further highlight the efficacy of incorporating multimodal pre-training in sequential recommendation representation learning, serving as an effective regularizer and optimizing the parameter space for the recommendation task.

Publisher

Association for Computing Machinery (ACM)

Link

https://dl.acm.org/doi/pdf/10.1145/3682075

Reference72 articles.

1. Hangbo Bao, Wenhui Wang, Li Dong, Qiang Liu, Owais Khan Mohammed, Kriti Aggarwal, Subhojit Som, Songhao Piao, and Furu Wei. 2022. Vlmo: Unified vision-language pre-training with mixture-of-modality-experts. Advances in Neural Information Processing Systems (2022), 32897–32912.

2. Sequential Recommendation with Graph Neural Networks

3. Learning Transferable User Representations with Sequential Behaviors via Contrastive Pre-training

4. MV-RNN: A Multi-View Recurrent Neural Network for Sequential Recommendation

5. Sequential User-based Recurrent Neural Network Recommendations

Cited by 1 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Multimodal Recommender Systems: A Survey;ACM Computing Surveys;2024-09-10