Open-Vocabulary Part-Level Detection and Segmentation for Human–Robot Interaction
-
Published:2024-07-21
Issue:14
Volume:14
Page:6356
-
ISSN:2076-3417
-
Container-title:Applied Sciences
-
language:en
-
Short-container-title:Applied Sciences
Author:
Yang Shan1, Liu Xiongding1, Wei Wu1ORCID
Affiliation:
1. School of Automation Science and Engineering, South China University of Technology, Guangzhou 510641, China
Abstract
Object detection and segmentation have made great progress in robotic application missions. However, intelligent agents require fine-grained recognition algorithms rather than object-level and language instructions to enhance the interaction between humans and robots. To improve the robot’s interactivity in the process of the robot response to language instructions, we propose a method for part-level detection and segmentation by exploiting vision language models. In this approach, Swin Transformer is introduced in the image encoder for extracting image features, and FPNs (Feature Pyramid Networks) are modified to better process the features from Swin Transformer. Next, the image decoder is proposed for model aligning between the image features and text embeddings for achieving human–robot interaction via language. Finally, we verify that the text embeddings are impacted by the command of input and that different prompt templates also affect classification. The method proposed in this paper, which is validated on two datasets (PartImagePart and Pascal Part), possesses the ability to understand and execute part-level missions and accurately segments and detects parts compared with existing interactive methods.
Funder
National Natural Science Foundation of China Science and Technology Planning Project of Guangdong Province, China
Reference38 articles.
1. Matheson, E., Minto, R., Zampieri, E.G., Faccio, M., and Rosati, G. (2019). Human–robot collaboration in manufacturing applications: A review. Robotics, 8. 2. Trends in integration of vision and language research: A survey of tasks, datasets, and methods;Mogadala;J. Artif. Intell. Res.,2021 3. Learning to prompt for vision-language models;Zhou;Int. J. Comput. Vis.,2022 4. Pan, T.Y., Liu, Q., Chao, W.L., and Price, B. (2023, January 17–24). Towards open-world segmentation of parts. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada. 5. Han, M., Zheng, H., Wang, C., Luo, Y., Hu, H., Zhang, J., and Wen, Y. (2023). PartSeg: Few-shot Part Segmentation via Part-aware Prompt Learning. arXiv.
|
|