VL-Few: Vision Language Alignment for Multimodal Few-Shot Meta Learning-Reference-Cited by-同舟云学术

VL-Few: Vision Language Alignment for Multimodal Few-Shot Meta Learning

Published:2024-01-30 Issue:3 Volume:14 Page:1169
ISSN:2076-3417
Container-title:Applied Sciences
language:en
Short-container-title:Applied Sciences

Author:

Ma Han¹^ORCID,Fan Baoyu¹^ORCID,Ng Benjamin K.¹,Lam Chan-Tong¹^ORCID

Affiliation:

1. Faculty of Applied Sciences, Macao Polytechnic University, Macao 999078, China

Abstract

Complex tasks in the real world involve different modal models, such as visual question answering (VQA). However, traditional multimodal learning requires a large amount of aligned data, such as image text pairs, and constructing a large amount of training data is a challenge for multimodal learning. Therefore, we propose VL-Few, which is a simple and effective method to solve the multimodal few-shot problem. VL-Few (1) proposes the modal alignment, which aligns visual features into language space through a lightweight model network and improves the multimodal understanding ability of the model; (2) adopts few-shot meta learning in the multimodal problem, which constructs a few-shot meta task pool to improve the generalization ability of the model; (3) proposes semantic alignment to enhance the semantic understanding ability of the model for the task, context, and demonstration; (4) proposes task alignment that constructs training data into the target task form and improves the task understanding ability of the model; (5) proposes generation alignment, which adopts the token-level training and multitask fusion loss to improve the generation ability of the model. Our experimental results show the effectiveness of VL-Few for multimodal few-shot problems.

Funder

Macao Polytechnic University

Publisher

MDPI AG

Link

https://www.mdpi.com/2076-3417/14/3/1169/pdf

Reference61 articles.

1. Devlin, J., Chang, M.W., Lee, K., and Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv.

2. Radford, A., Narasimhan, K., Salimans, T., and Sutskever, I. (2018). Improving Language Understanding by Generative Pre-Training, OpenAI.

3. Language models are unsupervised multitask learners;Radford;OpenAI Blog,2019

4. Exploring the limits of transfer learning with a unified text-to-text transformer;Raffel;J. Mach. Learn. Res.,2020

5. Zeng, A., Liu, X., Du, Z., Wang, Z., Lai, H., Ding, M., Yang, Z., Xu, Y., Zheng, W., and Xia, X. (2022). Glm-130b: An open bilingual pre-trained model. arXiv.