Graph Pooling Inference Network for Text-based VQA-Reference-Cited by-同舟云学术

Graph Pooling Inference Network for Text-based VQA

Published:2024-01-11 Issue:4 Volume:20 Page:1-21
ISSN:1551-6857
Container-title:ACM Transactions on Multimedia Computing, Communications, and Applications
language:en
Short-container-title:ACM Trans. Multimedia Comput. Commun. Appl.

Author:

Zhou Sheng¹^ORCID,Guo Dan¹^ORCID,Yang Xun²^ORCID,Dong Jianfeng³^ORCID,Wang Meng¹^ORCID

Affiliation:

1. HeFei University of Technology, China

2. University of Science and Technology of China, China

3. Zhejiang Gongshang University, China

Abstract

Effectively leveraging objects and optical character recognition (OCR) tokens to reason out pivotal scene text is critical for the challenging Text-based Visual Question Answering (TextVQA) task. Graph-based models can effectively capture the semantic relationship among visual entities (objects and tokens) and report remarkable performance in TextVQA. However, previous efforts usually leverage all visual entities and ignore the negative effect of superfluous entities. This article presents a Graph Pooling Inference Network (GPIN), which is an evolutionary graph learning method to purify the visual entities and capture the core semantics. It is observed that the dense distribution of reduplicative objects and the crowd of semantically dependent OCR tokens usually co-exist in the image. Motivated by this, GPIN adopts an adaptive node dropping strategy to dynamically downscale semantically closed nodes for graph evolution and update. To deepen the comprehension of scene text, GPIN is a dual-path hierarchical graph architecture that progressively aggregates the evolved object graph and the evolved token graph semantics into a graph vector that serves as visual cues to facilitate the answer reasoning. It can effectively eliminate object redundancy and enhance the association of semantically continuous tokens. Experiments conducted on TextVQA and ST-VQA datasets show that GPIN achieves promising performance compared with state-of-the-art methods.

Funder

National Key Research and Development Program of China

National Natural Science Foundation of China

Major Project of Anhui Province

University Synergy Innovation Program of Anhui Province

Publisher

Association for Computing Machinery (ACM)

Subject

Computer Networks and Communications,Hardware and Architecture

Link

https://dl.acm.org/doi/pdf/10.1145/3634918

Reference64 articles.

1. Word spotting and recognition with embedded attributes;Almazán Jon;IEEE Trans. Pattern Anal. Mach. Intell.,2014

2. Ali Furkan Biten, Ron Litman, Yusheng Xie, Srikar Appalaraju, and R. Manmatha. 2022. Latr: Layout-aware transformer for scene-text vqa. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’22). 16548–16558.

3. Ali Furkan Biten, Ruben Tito, Andres Mafla, Lluis Gomez, Marcal Rusinol, C. V. Jawahar, Ernest Valveny, and Dimosthenis Karatzas. 2019. Scene text visual question answering. In Proceedings of the International Conference on Computer Vision (ICCV’19). 4290–4300.

4. Enriching word vectors with subword information;Bojanowski Piotr;Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL’17),2017

5. Fedor Borisyuk, Albert Gordo, and Viswanath Sivakumar. 2018. Rosetta: Large scale system for text detection and recognition in images. In Proceedings of the ACM Knowledge Discovery and Data Mining (SIGKDD’18). 71–79.

Cited by 3 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Oscar: Omni-scale robust contrastive learning for Text-VQA;Expert Systems with Applications;2024-12

2. ViCLEVR: a visual reasoning dataset and hybrid multimodal fusion model for visual question answering in Vietnamese;Multimedia Systems;2024-07-06

3. Learning Hierarchical Visual Transformation for Domain Generalizable Visual Matching and Recognition;International Journal of Computer Vision;2024-05-27