Multi-Task Foreground-Aware Network with Depth Completion for Enhanced RGB-D Fusion Object Detection Based on Transformer-Reference-Cited by-同舟云学术

Multi-Task Foreground-Aware Network with Depth Completion for Enhanced RGB-D Fusion Object Detection Based on Transformer

Published:2024-04-08 Issue:7 Volume:24 Page:2374
ISSN:1424-8220
Container-title:Sensors
language:en
Short-container-title:Sensors

Author:

Pan Jiasheng¹,Zhong Songyi²³,Yue Tao²^ORCID,Yin Yankun³,Tang Yanhao³

Affiliation:

1. School of Computer Engineering and Science, Shanghai University, No. 99 Shangda Road, Shanghai 200444, China

2. School of Mechatronic Engineering and Automation, Shanghai University, No. 99 Shangda Road, Shanghai 200444, China

3. School of Artificial Intelligence, Shanghai University, No. 99 Shangda Road, Shanghai 200444, China

Abstract

Fusing multiple sensor perceptions, specifically LiDAR and camera, is a prevalent method for target recognition in autonomous driving systems. Traditional object detection algorithms are limited by the sparse nature of LiDAR point clouds, resulting in poor fusion performance, especially for detecting small and distant targets. In this paper, a multi-task parallel neural network based on the Transformer is constructed to simultaneously perform depth completion and object detection. The loss functions are redesigned to reduce environmental noise in depth completion, and a new fusion module is designed to enhance the network’s perception of the foreground and background. The network leverages the correlation between RGB pixels for depth completion, completing the LiDAR point cloud and addressing the mismatch between sparse LiDAR features and dense pixel features. Subsequently, we extract depth map features and effectively fuse them with RGB features, fully utilizing the depth feature differences between foreground and background to enhance object detection performance, especially for challenging targets. Compared to the baseline network, improvements of 4.78%, 8.93%, and 15.54% are achieved in the difficult indicators for cars, pedestrians, and cyclists, respectively. Experimental results also demonstrate that the network achieves a speed of 38 fps, validating the efficiency and feasibility of the proposed method.

Funder

National Natural Science Foundation of China

Shanghai Science and Technology Committee Natural Science Program

Publisher

MDPI AG

Link

https://www.mdpi.com/1424-8220/24/7/2374/pdf

Reference63 articles.

1. Liu, Z., Tang, H., Amini, A., Yang, X., Mao, H., Rus, D.L., and Han, S. (June, January 29). Bevfusion: Multi-task multi-sensor fusion with unified bird’s-eye view representation. Proceedings of the 2023 IEEE International Conference on Robotics and Automation (ICRA), London, UK.

2. Stereo matching algorithm based on deep learning: A survey;Hamid;J. King Saud-Univ.-Comput. Inf. Sci.,2022

3. Unsupervised object class discovery via saliency-guided multiple class learning;Zhu;IEEE Trans. Pattern Anal. Mach. Intell.,2014

4. Re-thinking co-salient object detection;Fan;IEEE Trans. Pattern Anal. Mach. Intell.,2021

5. Human action recognition from various data modalities: A review;Sun;IEEE Trans. Pattern Anal. Mach. Intell.,2022