Affiliation:
1. Department of Computer Science, Baylor University, Waco, TX 76706, USA
2. Department of Computer Science, Texas A&M University, College Station, TX 77843, USA
Abstract
Images and text have become essential parts of the multimodal machine learning (MMML) framework in today’s world because data are always available, and technological breakthroughs bring disparate forms together, and while text adds semantic richness and narrative to images, images capture visual subtleties and emotions. Together, these two media improve knowledge beyond what would be possible with just one revolutionary application. This paper investigates feature extraction and advancement from text and image data using pre-trained models in MMML. It offers a thorough analysis of fusion architectures, outlining text and image data integration and evaluating their overall advantages and effects. Furthermore, it draws attention to the shortcomings and difficulties that MMML currently faces and guides areas that need more research and development. We have gathered 341 research articles from five digital library databases to accomplish this. Following a thorough assessment procedure, we have 88 research papers that enable us to evaluate MMML in detail. Our findings demonstrate that pre-trained models, such as BERT for text and ResNet for images, are predominantly employed for feature extraction due to their robust performance in diverse applications. Fusion techniques, ranging from simple concatenation to advanced attention mechanisms, are extensively adopted to enhance the representation of multimodal data. Despite these advancements, MMML models face significant challenges, including handling noisy data, optimizing dataset size, and ensuring robustness against adversarial attacks. Our findings highlight the necessity for further research to address these challenges, particularly in developing methods to improve the robustness of MMML models.
Funder
National Science Foundation
Reference85 articles.
1. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A., Kaiser, L., and Polosukhin, I. (2017). Attention is all you need. arXiv.
2. Multimodal machine learning: A survey and taxonomy;Ahuja;IEEE Trans. Pattern Anal. Mach. Intell.,2019
3. Talukder, S., Barnum, G., and Yue, Y. (2020). On the benefits of early fusion in multimodal representation learning. arXiv.
4. A survey on deep learning for multimodal data fusion;Gao;Neural Comput.,2020
5. Multimodal emotion recognition with transformer-based self supervised feature fusion;Siriwardhana;IEEE Access,2020
Cited by
1 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献