Multi-task learning and joint refinement between camera localization and object detection-Reference-Cited by-同舟云学术

Multi-task learning and joint refinement between camera localization and object detection

Published:2024-02-08 Issue: Volume: Page:
ISSN:2096-0433
Container-title:Computational Visual Media
language:en
Short-container-title:Comp. Visual Media

Author:

Wang Junyi,Qi Yue

Abstract

AbstractVisual localization and object detection both play important roles in various tasks. In many indoor application scenarios where some detected objects have fixed positions, the two techniques work closely together. However, few researchers consider these two tasks simultaneously, because of a lack of datasets and the little attention paid to such environments. In this paper, we explore multi-task network design and joint refinement of detection and localization. To address the dataset problem, we construct a medium indoor scene of an aviation exhibition hall through a semi-automatic process. The dataset provides localization and detection information, and is publicly available at https://drive.google.com/drive/folders/1U28zkuN4_I0dbzkqyIAKlAl5k9oUK0jI?usp=sharing for benchmarking localization and object detection tasks. Targeting this dataset, we have designed a multi-task network, JLDNet, based on YOLO v3, that outputs a target point cloud and object bounding boxes. For dynamic environments, the detection branch also promotes the perception of dynamics. JLDNet includes image feature learning, point feature learning, feature fusion, detection construction, and point cloud regression. Moreover, object-level bundle adjustment is used to further improve localization and detection accuracy. To test JLDNet and compare it to other methods, we have conducted experiments on 7 static scenes, our constructed dataset, and the dynamic TUM RGB-D and Bonn datasets. Our results show state-of-the-art accuracy for both tasks, and the benefit of jointly working on both tasks is demonstrated.

Publisher

Springer Science and Business Media LLC

Link

https://link.springer.com/content/pdf/10.1007/s41095-022-0319-z.pdf

Reference71 articles.

1. Bao, W.; Wang, W.; Xu, Y. H.; Guo, Y. L.; Hong, S. Y.; Zhang, X. H. InStereo2K: A large real dataset for stereo matching in indoor scenes. Science China Information Sciences Vol. 63, No. 11, 212101, 2020.

2. Yan, F. H.; Li, Z. X.; Zhou, Z. Robust and efficient edge-based visual odometry. Computational Visual Media Vol. 8, No. 3, 467–481, 2022.

3. Huang, J. H.; Yang, S.; Zhao, Z. S.; Lai, Y. K.; Hu, S. M. ClusterSLAM: A SLAM backend for simultaneous rigid body clustering and motion estimation. Computational Visual Media Vol. 7, No. 1, 87–101, 2021.

4. Wang, C.; Guo, X. H. Feature-based RGB-D camera pose optimization for real-time 3D reconstruction. Computational Visual Media Vol. 3, No. 2, 95–106, 2017.

5. Nakajima, Y.; Saito, H. Robust camera pose estimation by viewpoint classification using deep learning. Computational Visual Media Vol. 3, No. 2, 189–198, 2017.