Text-Vision Relationship Alignment for Referring Image Segmentation-Reference-Cited by-同舟云学术

Text-Vision Relationship Alignment for Referring Image Segmentation

Published:2024-02-22 Issue:2 Volume:56 Page:
ISSN:1573-773X
Container-title:Neural Processing Letters
language:en
Short-container-title:Neural Process Lett

Author:

Pu Mingxing,Luo Bing,Zhang Chao,Xu Li,Xu Fayou,Kong Mingming

Abstract

AbstractReferring image segmentation aims to segment object in an image based on a referring expression. Its difficulty lies in aligning expression semantics with visual instances. The existing methods based on semantic reasoning are limited by the performance of external syntax parser and do not explicitly explore the relationships between visual instances. This article proposes an end-to-end method for referring image segmentation by aligning ’linguistic relationship’ with ’visual relationships’. This method does not rely on external syntax parser for expression parsing. In this paper, the expression is adaptively and structurally parsed into three components: ’subject’, ’object’, and ’linguistic relationship’ by the Semantic Component Parser (SCP) in a learnable manner. Instances Activation Map Module (IAM) locates multiple visual instances based on the subject and object. In addition, the Relationship Based Visual Localization Module (RBVL) firstly enables each instance of the image to learn global knowledge, then decodes the visual relationships between these visual instances, and finally aligns the visual relationships with the linguistic relationships to further accurately locate the target object. The experimental results show that the proposed method improves performance by 4– 9% compared with baseline method on multiple referring image segmentation datasets.

Publisher

Springer Science and Business Media LLC

Link

https://link.springer.com/content/pdf/10.1007/s11063-024-11487-2.pdf

Reference53 articles.

1. Khan E (2012) Natural language based human computer interaction : a necessity for mobile devices. https://api.semanticscholar.org/CorpusID:15641099

2. Chen J, Shen Y, Gao J, Liu J, Liu X (2017) Language-based image editing with recurrent attentive models. In: 2018 IEEE/CVF conference on computer vision and pattern recognition, pp 8721–8729

3. Liu D, Zhang H, Zha Z, Wu F (2018) Learning to assemble neural module tree networks for visual grounding. In: 2019 IEEE/CVF international conference on computer vision (ICCV), pp 4672–4681

4. Hui T, Liu S, Huang S, Li G, Yu S, Zhang F, Han J (2020) Linguistic structure guided context modeling for referring image segmentation. In: European conference on computer vision

5. Luo G, Zhou Y, Sun X, Cao L, Wu C, Deng C, Ji R (2020) Multi-task collaborative network for joint referring expression comprehension and segmentation. In: 2020 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 10031–10040