Affiliation:
1. Mohamed bin Zayed University of Artificial Intelligence, UAE
2. Sun Yat-sen University, China
3. University of Glasgow, UK
Abstract
Multimodality Representation Learning, as a technique of learning to embed information from different modalities and their correlations, has achieved remarkable success on a variety of applications, such as Visual Question Answering (VQA), Natural Language for Visual Reasoning (NLVR), and Vision Language Retrieval (VLR). Among these applications, cross-modal interaction and complementary information from different modalities are crucial for advanced models to perform any multimodal task, e.g., understand, recognize, retrieve, or generate optimally. Researchers have proposed diverse methods to address these tasks. The different variants of transformer-based architectures performed extraordinarily on multiple modalities. This survey presents the comprehensive literature on the evolution and enhancement of deep learning multimodal architectures to deal with textual, visual and audio features for diverse cross-modal and modern multimodal tasks. This study summarizes the (
i
) recent task-specific deep learning methodologies, (
ii
) the pretraining types and multimodal pretraining objectives, (
iii
) from state-of-the-art pretrained multimodal approaches to unifying architectures, and (
iv
) multimodal task categories and possible future improvements that can be devised for better multimodal learning. Moreover, we prepare a dataset section for new researchers that covers most of the benchmarks for pretraining and finetuning. Finally, major challenges, gaps, and potential research topics are explored. A constantly-updated paperlist related to our survey is maintained at
https://github.com/marslanm/multimodality-representation-learning
.
Publisher
Association for Computing Machinery (ACM)
Subject
Computer Networks and Communications,Hardware and Architecture
Reference217 articles.
1. Recent advances and trends in multimodal deep learning: A review;Summaira Jabeen;arXiv preprint arXiv:2105.11087,2021
2. Multimodal Machine Learning: A Survey and Taxonomy
3. Bowen Shi, Wei-Ning Hsu, Kushal Lakhotia, and Abdelrahman Mohamed. 2022. Learning audio-visual speech representation by masked multimodal cluster prediction. In Proceedings of the International Conference on Learning Representations.
4. Detecting Propaganda Techniques in Memes
5. VQA: Visual Question Answering
Cited by
3 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
1. ACTOR: Adapting CLIP for Fully Transformer-based Open-vocabulary Detection;Proceedings of the 2024 International Conference on Generative Artificial Intelligence and Information Security;2024-05-10
2. Personalized time-sync comment generation based on a multimodal transformer;Multimedia Systems;2024-03-30
3. Self-regulating Prompts: Foundational Model Adaptation without Forgetting;2023 IEEE/CVF International Conference on Computer Vision (ICCV);2023-10-01