Foundations & Trends in Multimodal Machine Learning: Principles, Challenges, and Open Questions-Reference-Cited by-同舟云学术

Foundations & Trends in Multimodal Machine Learning: Principles, Challenges, and Open Questions

Published:2024-06-22 Issue:10 Volume:56 Page:1-42
ISSN:0360-0300
Container-title:ACM Computing Surveys
language:en
Short-container-title:ACM Comput. Surv.

Author:

Liang Paul Pu¹^ORCID,Zadeh Amir¹^ORCID,Morency Louis-Philippe¹^ORCID

Affiliation:

1. Carnegie Mellon University, Pittsburgh, United States

Abstract

Multimodal machine learning is a vibrant multi-disciplinary research field that aims to design computer agents with intelligent capabilities such as understanding, reasoning, and learning through integrating multiple communicative modalities, including linguistic, acoustic, visual, tactile, and physiological messages. With the recent interest in video understanding, embodied autonomous agents, text-to-image generation, and multisensor fusion in application domains such as healthcare and robotics, multimodal machine learning has brought unique computational and theoretical challenges to the machine learning community given the heterogeneity of data sources and the interconnections often found between modalities. However, the breadth of progress in multimodal research has made it difficult to identify the common themes and open questions in the field. By synthesizing a broad range of application domains and theoretical frameworks from both historical and recent perspectives, this article is designed to provide an overview of the computational and theoretical foundations of multimodal machine learning. We start by defining three key principles of modality heterogeneity , connections , and interactions that have driven subsequent innovations, and propose a taxonomy of six core technical challenges: representation , alignment , reasoning , generation , transference , and quantification covering historical and recent trends. Recent technical achievements will be presented through the lens of this taxonomy, allowing researchers to understand the similarities and differences across new approaches. We end by motivating several open problems for future research as identified by our taxonomy.

Publisher

Association for Computing Machinery (ACM)

Link

https://dl.acm.org/doi/pdf/10.1145/3656580

Reference419 articles.

1. Deep Multimodal Subspace Clustering Networks

2. Persistent Anti-Muslim Bias in Large Language Models

3. Multi-modal haptic feedback for grip force reduction in robotic surgery;Scientific Reports,2019

4. Multimodal biomedical AI

5. VL-InterpreT: An Interactive Visualization Tool for Interpreting Vision-Language Transformers

Cited by 7 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Scalable multimodal assessment of the micro-neighborhood using orthogonal visual inputs;Journal of Housing and the Built Environment;2024-08-19

2. The Dark Side of Language Models: Exploring the Potential of LLMs in Multimedia Disinformation Generation and Dissemination;Machine Learning with Applications;2024-06

3. ETP: Learning Transferable ECG Representations via ECG-Text Pre-Training;ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP);2024-04-14

4. Meta Noise Adaption Framework for Multimodal Sentiment Analysis With Feature Noise;IEEE Transactions on Multimedia;2024

5. A Comprehensive Survey on Deep Learning Multi-Modal Fusion: Methods, Technologies and Applications;Computers, Materials & Continua;2024