Cross-Modal Representation Learning-Reference-Cited by-同舟云学术

Cross-Modal Representation Learning

Published:2023 Issue: Volume: Page:211-240
ISSN:
Container-title:Representation Learning for Natural Language Processing
language:
Short-container-title:

Author:

Yao Yuan,Liu Zhiyuan,Lin Yankai,Sun Maosong

Abstract

AbstractCross-modal representation learning is an essential part of representation learning, which aims to learn semantic representations for different modalities including text, audio, image and video, etc., and their connections. In this chapter, we introduce the development of cross-modal representation learning from shallow to deep, and from respective to unified in terms of model architectures and learning mechanisms for different modalities and tasks. After that, we review how cross-modal capabilities can contribute to complex real-world applications.

Publisher

Springer Nature Singapore

Link

https://link.springer.com/content/pdf/10.1007/978-981-99-1600-9_7

Reference125 articles.

1. Aishwarya Agrawal, Dhruv Batra, Devi Parikh, and Aniruddha Kembhavi. Don’t just assume; look and answer: Overcoming priors for visual question answering. In Proceedings of CVPR, 2018.

2. Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob Menick, Sebastian Borgeaud, Andrew Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikolaj Binkowski, Ricardo Barreira, Oriol Vinyals, Andrew Zisserman, and Karen Simonyan. Flamingo: a visual language model for few-shot learning. In Proceedings of NeurIPS, 2022.

3. Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of CVPR, 2018.

4. Peter Anderson, Qi Wu, Damien Teney, Jake Bruce, Mark Johnson, Niko Sünderhauf, Ian Reid, Stephen Gould, and Anton Van Den Hengel. Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. In Proceedings of CVPR, 2018.

5. Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. VQA: Visual question answering. In Proceedings of ICCV, 2015.

Cited by 3 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Heteroassociative Mapping with Self-Organizing Maps for Probabilistic Multi-output Prediction;2024 International Joint Conference on Neural Networks (IJCNN);2024-06-30

2. MMCRec: Towards Multi-modal Generative AI in Conversational Recommendation;Lecture Notes in Computer Science;2024

3. New Cloth Unto an Old Garment: SOM for Regeneration Learning;Lecture Notes in Networks and Systems;2024