AEPF: Attention-Enabled Point Fusion for 3D Object Detection-Reference-Cited by-同舟云学术

AEPF: Attention-Enabled Point Fusion for 3D Object Detection

Published:2024-09-09 Issue:17 Volume:24 Page:5841
ISSN:1424-8220
Container-title:Sensors
language:en
Short-container-title:Sensors

Author:

Sharma Sachin¹^ORCID,Meyer Richard T.¹^ORCID,Asher Zachary D.¹^ORCID

Affiliation:

1. Department of Mechanical and Aerospace Engineering, Western Michigan University, 1903 West Michigan Ave, Kalamazoo, MI 49008, USA

Abstract

Current state-of-the-art (SOTA) LiDAR-only detectors perform well for 3D object detection tasks, but point cloud data are typically sparse and lacks semantic information. Detailed semantic information obtained from camera images can be added with existing LiDAR-based detectors to create a robust 3D detection pipeline. With two different data types, a major challenge in developing multi-modal sensor fusion networks is to achieve effective data fusion while managing computational resources. With separate 2D and 3D feature extraction backbones, feature fusion can become more challenging as these modes generate different gradients, leading to gradient conflicts and suboptimal convergence during network optimization. To this end, we propose a 3D object detection method, Attention-Enabled Point Fusion (AEPF). AEPF uses images and voxelized point cloud data as inputs and estimates the 3D bounding boxes of object locations as outputs. An attention mechanism is introduced to an existing feature fusion strategy to improve 3D detection accuracy and two variants are proposed. These two variants, AEPF-Small and AEPF-Large, address different needs. AEPF-Small, with a lightweight attention module and fewer parameters, offers fast inference. AEPF-Large, with a more complex attention module and increased parameters, provides higher accuracy than baseline models. Experimental results on the KITTI validation set show that AEPF-Small maintains SOTA 3D detection accuracy while inferencing at higher speeds. AEPF-Large achieves mean average precision scores of 91.13, 79.06, and 76.15 for the car class’s easy, medium, and hard targets, respectively, in the KITTI validation set. Results from ablation experiments are also presented to support the choice of model architecture.

Funder

US DOE’s Office of Energy Efficiency and Renewable Energy

Publisher

MDPI AG

Link

https://www.mdpi.com/1424-8220/24/17/5841/pdf

Reference53 articles.

1. 3D object detection for autonomous driving: A comprehensive survey;Mao;Int. J. Comput. Vis.,2023

2. Girshick, R. (2015, January 7–13). Fast r-cnn. Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile.

3. Faster r-cnn: Towards real-time object detection with region proposal networks;Ren;Adv. Neural Inf. Process. Syst.,2015

4. Mousavian, A., Anguelov, D., Flynn, J., and Košecká, J. (2017, January 21–26). 3D Bounding Box Estimation Using Deep Learning and Geometry. Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA.

5. Zhang, Y., Lu, J., and Zhou, J. (2021, January 19–25). Objects are Different: Flexible Monocular 3D Object Detection. Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA.