Prompt Learns Prompt: Exploring Knowledge-Aware Generative Prompt Collaboration For Video Captioning-Reference-Cited by-同舟云学术

Prompt Learns Prompt: Exploring Knowledge-Aware Generative Prompt Collaboration For Video Captioning

Published:2023-08 Issue: Volume: Page:
ISSN:
Container-title:Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence
language:
Short-container-title:

Author:

Yan Liqi¹,Han Cheng²,Xu Zenglin³,Liu Dongfang²,Wang Qifan⁴

Affiliation:

1. Fudan University

2. Rochester Institute of Technology

3. Harbin Institute of Technology, Shenzhen

4. Meta AI

Abstract

Fine-tuning large vision-language models is a challenging task. Prompt tuning approaches have been introduced to learn fixed textual or visual prompts while freezing the pre-trained model in downstream tasks. Despite the effectiveness of prompt tuning, what do those learnable prompts learn remains unexplained. In this work, we explore whether prompts in the fine-tuning can learn knowledge-aware prompts from the pre-training, by designing two different sets of prompts in pre-training and fine-tuning phases respectively. Specifically, we present a Video-Language Prompt tuning (VL-Prompt) approach for video captioning, which first efficiently pre-train a video-language model to extract key information (e.g., actions and objects) with flexibly generated Knowledge-Aware Prompt (KAP). Then, we design a Video-Language Prompt (VLP) to transfer the knowledge from the knowledge-aware prompts and fine-tune the model to generate full captions. Experimental results show the superior performance of our approach over several state-of-the-art baselines. We further demonstrate that the video-language prompts are well learned from the knowledge-aware prompts.

Publisher

International Joint Conferences on Artificial Intelligence Organization

Cited by 8 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Compositional Kronecker Context Optimization for vision–language models;Neurocomputing;2024-12

2. Video text tracking with transformer-based local search;Neurocomputing;2024-12

3. A new multimodal sentiment analysis for images containing textual information;Multimedia Tools and Applications;2024-08-15

4. Question guided multimodal receptive field reasoning network for fact-based visual question answering;Multimedia Tools and Applications;2024-05-20

5. Multimodal pre-train then transfer learning approach for speaker recognition;Multimedia Tools and Applications;2024-02-26