Transformer-Based Visual Grounding with Cross-Modality Interaction-Reference-Cited by-同舟云学术

Transformer-Based Visual Grounding with Cross-Modality Interaction

Published:2023-05-30 Issue:6 Volume:19 Page:1-19
ISSN:1551-6857
Container-title:ACM Transactions on Multimedia Computing, Communications, and Applications
language:en
Short-container-title:ACM Trans. Multimedia Comput. Commun. Appl.

Author:

Li Kun¹^ORCID,Li Jiaxiu¹^ORCID,Guo Dan¹^ORCID,Yang Xun²^ORCID,Wang Meng¹^ORCID

Affiliation:

1. Hefei University of Technology, China

2. University of Science and Technology of China, China

Abstract

This article tackles the challenging yet important task of Visual Grounding (VG), which aims to localize a visual region in the given image referred by a natural language query. Existing efforts on the VG task are twofold: (1) two-stage methods first extract region proposals and then rank them according to their similarities with the referring expression, which usually leads to suboptimal results due to the quality of region proposals; (2) one-stage methods usually predict all the possible coordinates of the target region online by leveraging modern object detection architectures, which pay little attention to cross-modality correlations and have limited generalization ability. To better address the task, we present an effective transformer-based end-to-end visual grounding approach, which focuses on capturing the cross-modality correlations between the referring expression and visual regions for accurately reasoning the location of the target region. Specifically, our model consists of a feature encoder, a cross-modality interactor, and a modality-agnostic decoder. The feature encoder is employed to capture the intra-modality correlation, which models the linguistic context in query and the spatial dependency in image respectively. The cross-modality interactor endows the model with the capability of highlighting the localization-relevant visual and textual cues by mutual verification of vision and language, which plays a key role in our model. The decoder learns a consolidated token representation enriched by multi-modal contexts and further directly predicts the box coordinates. Extensive experiments on five public benchmark datasets with quantitative and qualitative analysis clearly demonstrate the effectiveness and rationale of our proposed method.

Funder

National Natural Science Foundation of China

Major Project of Anhui Province

Publisher

Association for Computing Machinery (ACM)

Subject

Computer Networks and Communications,Hardware and Architecture

Link

https://dl.acm.org/doi/pdf/10.1145/3587251

Reference58 articles.

1. End-to-End Object Detection with Transformers

2. Real-time referring expression comprehension by single-stage grounding network;Chen Xinpeng;arXiv preprint arXiv:1812.03426,2018

3. Volkan Cirik, Taylor Berg-Kirkpatrick, and Louis-Philippe Morency. 2018. Using syntax to ground referring expressions in natural images. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32.

4. Jiajun Deng, Zhengyuan Yang, Tianlang Chen, Wengang Zhou, and Houqiang Li. 2021. TransVG: End-to-end visual grounding with transformers. In Proceedings of the IEEE International Conference on Computer Vision. 1769–1779.

5. BERT: Pre-training of deep bidirectional transformers for language understanding;Devlin Jacob;arXiv:1810.04805,2018

Cited by 10 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Language conditioned multi-scale visual attention networks for visual grounding;Image and Vision Computing;2024-10

2. EPK-CLIP: External and Priori Knowledge CLIP for action recognition;Expert Systems with Applications;2024-10

3. Low-light wheat image enhancement using an explicit inter-channel sparse transformer;Computers and Electronics in Agriculture;2024-09

4. SEMScene: Semantic-Consistency Enhanced Multi-Level Scene Graph Matching for Image-Text Retrieval;ACM Transactions on Multimedia Computing, Communications, and Applications;2024-08-16

5. Artificial intelligence in ischemic stroke images: current applications and future directions;Frontiers in Neurology;2024-07-10