DS-Trans: A 3D Object Detection Method Based on a Deformable Spatiotemporal Transformer for Autonomous Vehicles
-
Published:2024-04-30
Issue:9
Volume:16
Page:1621
-
ISSN:2072-4292
-
Container-title:Remote Sensing
-
language:en
-
Short-container-title:Remote Sensing
Author:
Zhu Yuan1, Xu Ruidong1ORCID, Tao Chongben2, An Hao1, Wang Huaide1, Sun Zhipeng3, Lu Ke1ORCID
Affiliation:
1. School of Automotive Studies, Tongji University, Shanghai 201800, China 2. School of Electronic and Information Engineering, Suzhou University of Science and Technology, Suzhou 215009, China 3. Nanchang Automotive Institute of Intelligence and New Energy, Tongji University, Nanchang 330010, China
Abstract
Facing the significant challenge of 3D object detection in complex weather conditions and road environments, existing algorithms based on single-frame point cloud data struggle to achieve desirable results. These methods typically focus on spatial relationships within a single frame, overlooking the semantic correlations and spatiotemporal continuity between consecutive frames. This leads to discontinuities and abrupt changes in the detection outcomes. To address this issue, this paper proposes a multi-frame 3D object detection algorithm based on a deformable spatiotemporal Transformer. Specifically, a deformable cross-scale Transformer module is devised, incorporating a multi-scale offset mechanism that non-uniformly samples features at different scales, enhancing the spatial information aggregation capability of the output features. Simultaneously, to address the issue of feature misalignment during multi-frame feature fusion, a deformable cross-frame Transformer module is proposed. This module incorporates independently learnable offset parameters for different frame features, enabling the model to adaptively correlate dynamic features across multiple frames and improve the temporal information utilization of the model. A proposal-aware sampling algorithm is introduced to significantly increase the foreground point recall, further optimizing the efficiency of feature extraction. The obtained multi-scale and multi-frame voxel features are subjected to an adaptive fusion weight extraction module, referred to as the proposed mixed voxel set extraction module. This module allows the model to adaptively obtain mixed features containing both spatial and temporal information. The effectiveness of the proposed algorithm is validated on the KITTI, nuScenes, and self-collected urban datasets. The proposed algorithm achieves an average precision improvement of 2.1% over the latest multi-frame-based algorithms.
Funder
the Perspective Study Funding of Nanchang Automotive Institute of Intelligence and New Energy
Reference61 articles.
1. Stereo Priori RCNN Based Car Detection on Point Level for Autonomous Driving;Tao;Knowl. -Based Syst.,2021 2. Wang, T., Zhu, X., Pang, J., and Lin, D. (2021, January 11–17). FCOS3D: Fully Convolutional One-Stage Monocular 3D Object Detection. Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), IEEE, Montreal, BC, Canada. 3. Sun, J., Chen, L., Xie, Y., Zhang, S., Jiang, Q., Zhou, X., and Bao, H. (2020, January 13–19). Disp R-CNN: Stereo 3D Object Detection via Shape Prior Guided Instance Disparity Estimation. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA. 4. You, Y., Wang, Y., Chao, W.-L., Garg, D., Pleiss, G., Hariharan, B., Campbell, M., and Weinberger, K.Q. (2020, January 26–30). Pseudo-LiDAR++: Accurate Depth for 3D Object Detection in Autonomous Driving. Proceedings of the Eighth International Conference on Learning Representations, Addis Ababa, Ethiopia. 5. Wang, Y., Guizilini, V.C., Zhang, T., Wang, Y., Zhao, H., and Solomon, J. (2022, January 11). DETR3D: 3D Object Detection from Multi-View Images via 3D-to-2D Queries. Proceedings of the 5th Conference on Robot Learning, Baltimore, MD, USA.
|
|