Language-guided Residual Graph Attention Network and Data Augmentation for Visual Grounding-Reference-Cited by-同舟云学术

Language-guided Residual Graph Attention Network and Data Augmentation for Visual Grounding

Published:2023-08-24 Issue:1 Volume:20 Page:1-23
ISSN:1551-6857
Container-title:ACM Transactions on Multimedia Computing, Communications, and Applications
language:en
Short-container-title:ACM Trans. Multimedia Comput. Commun. Appl.

Author:

Wang Jia¹^ORCID,Shuai Hong-Han¹^ORCID,Li Yung-Hui²^ORCID,Cheng Wen-Huang³^ORCID

Affiliation:

1. National Yang Ming Chiao Tung University, Taiwan

2. Hon Hai Research Institute, Taiwan

3. National Taiwan University, Taiwan

Abstract

Visual grounding is an essential task in understanding the semantic relationship between the given text description and the target object in an image. Due to the innate complexity of language and the rich semantic context of the image, it is still a challenging problem to infer the underlying relationship and to perform reasoning between the objects in an image and the given expression. Although existing visual grounding methods have achieved promising progress, cross-modal mapping across different domains for the task is still not well handled, especially when the expressions are complex and long. To address the issue, we propose a language-guided residual graph attention network for visual grounding (LRGAT-VG), which enables us to apply deeper graph convolution layers with the assistance of residual connections between them. This allows us to better handle long and complex expressions than other graph-based methods. Furthermore, we perform a Language-guided Data Augmentation (LGDA), which is based on copy-paste operations on pairs of source and target images to increase the diversity of training data while maintaining the relationship between the objects in the image and the expression. With extensive experiments on three visual grounding benchmarks, including RefCOCO, RefCOCO+, and RefCOCOg, LRGAT-VG with LGDA achieves competitive performance with other state-of-the-art graph network-based referring expression approaches and demonstrates its effectiveness.

Funder

National Science and Technology Council of Taiwan

Publisher

Association for Computing Machinery (ACM)

Subject

Computer Networks and Communications,Hardware and Architecture

Link

https://dl.acm.org/doi/pdf/10.1145/3604557

Reference85 articles.

1. P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’18). IEEE Computer Society, Los Alamitos, CA, 6077–6086.

2. Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Dan Klein. 2016. Neural module networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 39–48.

3. Hsiang-Chun Chang, Hung-Jen Chen, Yu-Chia Shen, Hong-Han Shuai, and Wen-Huang Cheng. 2021. Re-Attention is all you need: Memory-efficient scene text detection via re-attention on uncertain regions. In Proceedings of 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS’21). IEEE, 452–459.

4. Deli Chen, Yankai Lin, Wei Li, Peng Li, Jie Zhou, and Xu Sun. 2020. Measuring and relieving the over-smoothing problem for graph neural networks from the topological view. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 3438–3445.

5. Kan Chen, Rama Kovvuri, and Ram Nevatia. 2017. Query-guided regression network with context policy for phrase grounding. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’17). IEEE Computer Society, 824–832.