Open-Vocabulary Part-Level Detection and Segmentation for Human–Robot Interaction-Reference-Cited by-同舟云学术

Open-Vocabulary Part-Level Detection and Segmentation for Human–Robot Interaction

Published:2024-07-21 Issue:14 Volume:14 Page:6356
ISSN:2076-3417
Container-title:Applied Sciences
language:en
Short-container-title:Applied Sciences

Author:

Yang Shan¹,Liu Xiongding¹,Wei Wu¹^ORCID

Affiliation:

1. School of Automation Science and Engineering, South China University of Technology, Guangzhou 510641, China

Abstract

Object detection and segmentation have made great progress in robotic application missions. However, intelligent agents require fine-grained recognition algorithms rather than object-level and language instructions to enhance the interaction between humans and robots. To improve the robot’s interactivity in the process of the robot response to language instructions, we propose a method for part-level detection and segmentation by exploiting vision language models. In this approach, Swin Transformer is introduced in the image encoder for extracting image features, and FPNs (Feature Pyramid Networks) are modified to better process the features from Swin Transformer. Next, the image decoder is proposed for model aligning between the image features and text embeddings for achieving human–robot interaction via language. Finally, we verify that the text embeddings are impacted by the command of input and that different prompt templates also affect classification. The method proposed in this paper, which is validated on two datasets (PartImagePart and Pascal Part), possesses the ability to understand and execute part-level missions and accurately segments and detects parts compared with existing interactive methods.

Funder

National Natural Science Foundation of China

Science and Technology Planning Project of Guangdong Province, China

Publisher

MDPI AG

Link

https://www.mdpi.com/2076-3417/14/14/6356/pdf

Reference38 articles.

1. Matheson, E., Minto, R., Zampieri, E.G., Faccio, M., and Rosati, G. (2019). Human–robot collaboration in manufacturing applications: A review. Robotics, 8.

2. Trends in integration of vision and language research: A survey of tasks, datasets, and methods;Mogadala;J. Artif. Intell. Res.,2021

3. Learning to prompt for vision-language models;Zhou;Int. J. Comput. Vis.,2022

4. Pan, T.Y., Liu, Q., Chao, W.L., and Price, B. (2023, January 17–24). Towards open-world segmentation of parts. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada.

5. Han, M., Zheng, H., Wang, C., Luo, Y., Hu, H., Zhang, J., and Wen, Y. (2023). PartSeg: Few-shot Part Segmentation via Part-aware Prompt Learning. arXiv.