Affiliation:
1. School of Computer Science Zhongyuan University of Technology Zhengzhou China
Abstract
AbstractWith the progressive augmentation of parameters in multimodal models, to optimize computational efficiency, some studies have adopted the approach of fine‐tuning the unimodal pre‐training model to achieve multimodal fusion tasks. However, these methods tend to rely solely on simplistic or singular fusion strategies, thereby neglecting more flexible fusion approaches. Moreover, existing methods prioritize the integration of modality features containing highly semantic information, often overlooking the influence of fusing low‐level features on the outcomes. Therefore, this study introduces an innovative approach named multilevel feature fusion guided by prompts (MFF‐GP), a multimodal dynamic fusion framework. It guides the dynamic neural network by prompt vectors to dynamically select the suitable fusion network for each hierarchical feature of the unimodal pre‐training model. This method improves the interactions between multiple modalities and promotes a more efficient fusion of features across them. Extensive experiments on the UPMC Food 101, SNLI‐VE and MM‐IMDB datasets demonstrate that with only a few trainable parameters, MFF‐GP achieves significant accuracy improvements compared to a newly designed PMF based on fine‐tuning—specifically, an accuracy improvement of 2.15% on the UPMC Food 101 dataset and 0.82% on the SNLI‐VE dataset. Further study of the results reveals that increasing the diversity of interactions between distinct modalities is critical and delivers significant performance improvements. Furthermore, for certain multimodal tasks, focusing on the low‐level features is beneficial for modality integration. Our implementation is available at: https://github.com/whq2024/MFF-GP.
Reference56 articles.
1. SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing
2. Gated multimodal units for information fusion;Arevalo J.;ArXiv,2017
3. Dynamic Routing Networks
4. End-to-End Object Detection with Transformers
5. Chen X. Liang C. Huang D. Real E. Wang K. Liu Y. Pham H. Dong X. Luong T. Hsieh C. Lu Y. &Le Q. V.(2023).Symbolic discovery of optimization algorithms.ArXivhttps://api.semanticscholar.org/CorpusID:256846990