PrimitivePose: Generic Model and Representation for 3D Bounding Box Prediction of Unseen Objects-Reference-Cited by-同舟云学术

PrimitivePose: Generic Model and Representation for 3D Bounding Box Prediction of Unseen Objects

Published:2023-08-09 Issue:03 Volume:17 Page:387-410
ISSN:1793-351X
Container-title:International Journal of Semantic Computing
language:en
Short-container-title:Int. J. Semantic Computing

Author:

Kriegler Andreas¹²,Beleznai Csaba¹,Gelautz Margrit²,Murschitz Markus³,Göbel Kai¹

Affiliation:

1. Vision, Automation and Control, AIT Austrian Institute of Technology, Giefinggasse 6, Vienna 1210, Austria

2. Visual Computing and Human-Centered Technology, TU Wien, Favoritenstraße 9-11, Vienna 1040, Austria

3. Vision, Automation and Control, AIT Austrian Institute of Technology, Giefinggasse 4, Vienna 1210, Austria

Abstract

A considerable amount of research is concerned with the challenging task of estimating three-dimensional (3D) pose and size for multi-object indoor scene configurations. Many existing models rely on a priori known object models, such as 3D CAD models and are therefore limited to a predefined set of object categories. This closed-set constraint limits the range of applications for robots interacting in dynamic environments where previously unseen objects may appear. This paper addresses this problem with a highly generic 3D bounding box detection method that relies entirely on geometric cues obtained from depth data percepts. While the generation of synthetic data, e.g. synthetic depth maps, is commonly used for this task, the well-known synth-to-real gap often emerges, which prohibits transition of models trained solely on synthetic data to the real world. To ameliorate this problem, we use stereo depth computation on synthetic data to obtain pseudo-realistic disparity maps. We then propose an intermediate representation, namely disparity-scaled surface normal (SN) images, which encodes geometry and at the same time preserves depth/scale information unlike the commonly used standard SNs. In a series of experiments, we demonstrate the usefulness of our approach, detecting everyday objects on a captured data set of tabletop scenes, and compare it to the popular PoseCNN model. We quantitatively show that standard SNs are less adequate for challenging 3D detection tasks by comparing predictions from the model trained on disparity alone, SNs and disparity-scaled SNs. Additionally, in an ablation study we investigate the minimal number of training samples required for such a learning task. Lastly, we make the tool used for 3D object annotation publicly available at: https://preview.tinyurl.com/3ycn8v5k . A video showcasing our results can be found at: https://preview.tinyurl.com/dzdzabek .

Publisher

World Scientific Pub Co Pte Ltd

Subject

Artificial Intelligence,Computer Networks and Communications,Computer Science Applications,Linguistics and Language,Information Systems,Software

Link

https://www.worldscientific.com/doi/pdf/10.1142/S1793351X23620027

Reference58 articles.

1. Urban Traffic Surveillance (UTS): A fully probabilistic 3D tracking approach based on 2D detections