Multimodal dynamic fusion framework: Multilevel feature fusion guided by prompts-Reference-Cited by-同舟云学术

Multimodal dynamic fusion framework: Multilevel feature fusion guided by prompts

Published:2024-07-11 Issue: Volume: Page:
ISSN:0266-4720
Container-title:Expert Systems
language:en
Short-container-title:Expert Systems

Author:

Pan Lei¹^ORCID,Wu Huan‐Qing¹^ORCID

Affiliation:

1. School of Computer Science Zhongyuan University of Technology Zhengzhou China

Abstract

AbstractWith the progressive augmentation of parameters in multimodal models, to optimize computational efficiency, some studies have adopted the approach of fine‐tuning the unimodal pre‐training model to achieve multimodal fusion tasks. However, these methods tend to rely solely on simplistic or singular fusion strategies, thereby neglecting more flexible fusion approaches. Moreover, existing methods prioritize the integration of modality features containing highly semantic information, often overlooking the influence of fusing low‐level features on the outcomes. Therefore, this study introduces an innovative approach named multilevel feature fusion guided by prompts (MFF‐GP), a multimodal dynamic fusion framework. It guides the dynamic neural network by prompt vectors to dynamically select the suitable fusion network for each hierarchical feature of the unimodal pre‐training model. This method improves the interactions between multiple modalities and promotes a more efficient fusion of features across them. Extensive experiments on the UPMC Food 101, SNLI‐VE and MM‐IMDB datasets demonstrate that with only a few trainable parameters, MFF‐GP achieves significant accuracy improvements compared to a newly designed PMF based on fine‐tuning—specifically, an accuracy improvement of 2.15% on the UPMC Food 101 dataset and 0.82% on the SNLI‐VE dataset. Further study of the results reveals that increasing the diversity of interactions between distinct modalities is critical and delivers significant performance improvements. Furthermore, for certain multimodal tasks, focusing on the low‐level features is beneficial for modality integration. Our implementation is available at: https://github.com/whq2024/MFF-GP.

Publisher

Wiley

Link

https://onlinelibrary.wiley.com/doi/pdf/10.1111/exsy.13668

Reference56 articles.

1. SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing

2. Gated multimodal units for information fusion;Arevalo J.;ArXiv,2017

3. Dynamic Routing Networks

4. End-to-End Object Detection with Transformers

5. Chen X. Liang C. Huang D. Real E. Wang K. Liu Y. Pham H. Dong X. Luong T. Hsieh C. Lu Y. &Le Q. V.(2023).Symbolic discovery of optimization algorithms.ArXivhttps://api.semanticscholar.org/CorpusID:256846990