PrimitiveNet: decomposing the global constraints for referring segmentation-Reference-Cited by-同舟云学术

PrimitiveNet: decomposing the global constraints for referring segmentation

Published:2024-06-27 Issue:1 Volume:2 Page:
ISSN:2731-9008
Container-title:Visual Intelligence
language:en
Short-container-title:Vis. Intell.

Author:

Liu Chang,Jiang Xudong^ORCID,Ding Henghui^ORCID

Abstract

AbstractIn referring segmentation, modeling the complicated constraints in the multimodal information is one of the most challenging problems. As the information in a given language expression and image becomes increasingly abundant, most of the current one-stage methods that directly output the segmentation mask encounter difficulties in understanding the complicated relationships between the image and the expression. In this work, we propose a PrimitiveNet to decompose the difficult global constraints into a set of simple primitives. Each primitive produces a primitive mask that represents only simple semantic meanings, e.g., all instances from the same category. Then, the output segmentation mask is computed by selectively combining these primitives according to the language expression. Furthermore, we propose a cross-primitive attention (CPA) module and a language-primitive attention (LPA) module to exchange information among all primitives and the language expression, respectively. The proposed CPA and LPA help the network find appropriate weights for primitive masks, so as to recover the target object. Extensive experiments have proven the effectiveness of our design and verified that the proposed network outperforms current state-of-the-art referring segmentation methods on three RefCOCO datasets.

Funder

NTU Presidential Postdoctoral Fellowship

Publisher

Springer Science and Business Media LLC

Link

https://link.springer.com/content/pdf/10.1007/s44267-024-00049-8.pdf

Reference67 articles.

1. Hu, R., Rohrbach, M., & Darrell, T. (2016). Segmentation from natural language expressions. In B. Leibe, J. Matas, N. Sebe, et al. (Eds.), Proceedings of the 14th European conference of computer vision (pp. 108–124). Cham: Springer.

2. Long, J., Shelhamer, E., & Darrell, T. (2015). Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3431–3440). Piscataway: IEEE.

3. Zhou, Y., Ji, R., Luo, G., Sun, X., Su, J., Ding, X., et al. (2023). A real-time global inference network for one-stage referring expression comprehension. IEEE Transactions on Neural Networks and Learning Systems, 34(1), 134–143.

4. Luo, G., Zhou, Y., Sun, J., Sun, X., & Ji, R. (2024). A survivor in the era of large-scale pretraining: an empirical study of one-stage referring expression comprehension. IEEE Transactions on Multimedia, 26, 3689–3700.

5. He, S., Ding, H., Liu, C., & Jiang, X. (2023). GREC: generalized referring expression comprehension. arXiv preprint. arXiv:2308.16182.