Enhancing Multimodal Understanding With LIUS-Reference-Cited by-同舟云学术

Enhancing Multimodal Understanding With LIUS

Published:2024-01-12 Issue:1 Volume:36 Page:1-17
ISSN:1546-2234
Container-title:Journal of Organizational and End User Computing
language:ng
Short-container-title:

Author:

Song Chunlai¹

Affiliation:

1. Department of Global Business, Kyungil University, South Korea

Abstract

VQA (visual question and answer) is the task of enabling a computer to generate accurate textual answers based on given images and related questions. It integrates computer vision and natural language processing and requires a model that is able to understand not only the image content but also the question in order to generate appropriate linguistic answers. However, current limitations in cross-modal understanding often result in models that struggle to accurately capture the complex relationships between images and questions, leading to inaccurate or ambiguous answers. This research aims to address this challenge through a multifaceted approach that combines the strengths of vision and language processing. By introducing the innovative LIUS framework, a specialized vision module was built to process image information and fuse features using multiple scales. The insights gained from this module are integrated with a “reasoning module” (LLM) to generate answers.

Publisher

IGI Global

Reference32 articles.

1. Aishwarya, R., Sarath, P., Sneha, U., & Manmadhan, S. (2022). Stacked Attention based Textbook Visual Question Answering with BERT. 2022 IEEE 19th India Council International Conference (INDICON).

2. Akula, A., Changpinyo, S., Gong, B., Sharma, P., Zhu, S.-C., & Soricut, R. (2021). Crossvqa: Scalably generating benchmarks for systematically testing vqa generalization. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing.

3. Alayrac, J.-B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., Lenc, K., Mensch, A., Millican, K., & Reynolds, M. (2022). Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35, 23716-23736.

4. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering

5. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering