Image to English translation and comprehension: INT2-VQA method based on inter-modality and intra-modality collaborations-Reference-Cited by-同舟云学术

Image to English translation and comprehension: INT2-VQA method based on inter-modality and intra-modality collaborations

Published:2023-08-30 Issue:8 Volume:18 Page:e0290315
ISSN:1932-6203
Container-title:PLOS ONE
language:en
Short-container-title:PLoS ONE

Author:

Sheng Xianli^ORCID

Abstract

Existing visual question answering methods typically concentrate only on visual targets in images, ignoring the key textual content in the images, thereby limiting the depth and accuracy of image content comprehension. Inspired by this, we pay attention to the task of text-based visual question answering, address the performance bottleneck issue caused by over-fitting risk in existing self-attention-based models, and propose a scenario text visual question answering method called INT2-VQA that fuses knowledge manifestation based on inter-modality and intra-modality collaborations. Specifically, we model the complementary priori knowledge of locational collaboration between visual targets and textual targets across modalities and the contextual semantical collaboration among textual word targets within a modality. Based on this, a universal knowledge-reinforced attention module is designed to achieve a unified encoding manifestation of both relations. Extensive ablation experiments, contrast experiments, and visual analyses demonstrate the effectiveness of the proposed method and prove its superiority over the other state-of-the-art methods.

Publisher

Public Library of Science (PLoS)

Subject

Multidisciplinary

Reference61 articles.

Cited by 1 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. ChatFFA: Interactive Visual Question Answering on Fundus Fluorescein Angiography Image Using ChatGPT;2023