Cross‐modal fusion encoder via graph neural network for referring image segmentation-Reference-Cited by-同舟云学术

Cross‐modal fusion encoder via graph neural network for referring image segmentation

Published:2024-01-02 Issue:4 Volume:18 Page:1083-1095
ISSN:1751-9659
Container-title:IET Image Processing
language:en
Short-container-title:IET Image Processing

Author:

Zhang Yuqing¹^ORCID,Zhang Yong¹,Piao Xinglin¹,Yuan Peng²,Hu Yongli¹,Yin Baocai¹

Affiliation:

1. Beijing Key Laboratory of Multimedia and Intelligent Software Technology Beijing Artificial Intelligence Institute Faculty of Information Technology, Beijing University of Technology Beijing China

2. China Electronics Technology Group Taiji Co Ltd Beijing China

Abstract

AbstractReferring image segmentation identifies the object masks from images with the guidance of input natural language expressions. Nowadays, many remarkable cross‐modal decoder are devoted to this task. But there are mainly two key challenges in these models. One is that these models usually lack to extract fine‐grained boundary information and gradient information of images. The other is that these models usually lack to explore language associations among image pixels. In this work, a Multi‐scale Gradient balanced Central Difference Convolution (MG‐CDC) and a Graph convolutional network‐based Language and Image Fusion (GLIF) for cross‐modal encoder, called Graph‐RefSeg, are designed. Specifically, in the shallow layer of the encoder, the MG‐CDC captures comprehensive fine‐grained image features. It could enhance the perception of target boundaries and provide effective guidance for deeper encoding layers. In each encoder layer, the GLIF is used for cross‐modal fusion. It could explore the correlation of every pixel and its corresponding language vectors by a graph neural network. Since the encoder achieves robust cross‐modal alignment and context mining, a light‐weight decoder could be used for segmentation prediction. Extensive experiments show that the proposed Graph‐RefSeg outperforms the state‐of‐the‐art methods on three public datasets. Code and models will be made publicly available at https://github.com/ZYQ111/Graph_refseg.

Funder

National Key Research and Development Program of China

National Natural Science Foundation of China

Publisher

Institution of Engineering and Technology (IET)

Reference59 articles.

1. Chen J. Shen Y. Gao J. Liu J. Liu X.:Language‐based image editing with recurrent attentive models. In:Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition pp.8721–8729.IEEE Piscataway(2018)

2. Wang X. Huang Q. Celikyilmaz A. Gao J. Shen D. Wang Y.:Reinforced cross‐modal matching and self‐supervised imitation learning for vision‐language navigation. In:Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition pp.6629–6638.IEEE Piscataway(2019)

3. Segmenting Objects in Day and Night: Edge-Conditioned CNN for Thermal Image Semantic Segmentation

4. Chen X. Lian Y. Jiao L. Wang H. Gao Y. Shi L.:Supervised edge attention network for accurate image instance segmentation. In:Proceedings of the European Conference on Computer Vision pp.617–631.Springer Berlin(2020)

5. Attentive Excitation and Aggregation for Bilingual Referring Image Segmentation

Cited by 1 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. MA‐ResUNet: Multi‐attention optic cup and optic disc segmentation based on improved U‐Net;IET Image Processing;2024-06-17