OccTr: A Two-Stage BEV Fusion Network for Temporal Object Detection
-
Published:2024-07-03
Issue:13
Volume:13
Page:2611
-
ISSN:2079-9292
-
Container-title:Electronics
-
language:en
-
Short-container-title:Electronics
Author:
Fu Qifang1, Yu Xinyi1, Ou Linlin1
Affiliation:
1. College of Information and Engineering, Zhejiang University of Technology, Hangzhou 310000, China
Abstract
Temporal fusion approaches are critical for 3D visual perception tasks in IOV (Internet of Vehicles), but they often rely on intermediate representations without fully utilizing position information from the previous frame’s detection results, which cannot compensate for the lack of depth information in visual data. In this work, we propose a novel framework called OccTr (Occupancy Transformer) that combines two temporal cues, intermediate representation and back-end representation, via occupancy map to enhance temporal fusion in object detection task. OccTr leverages attention mechanisms to perform both intermediate and back-end temporal fusion by incorporating intermediate BEV (bird’s-eye view) features and back-end prediction results of the detector. Our two-stage framework includes occupancy map generation and cross-attention feature fusion. In stage one, the prediction results are converted into occupancy grid map format to generate back-end representation. In stage two, the high-resolution occupancy maps are fused with BEV features using cross-attention layers. This fused temporal cue provides a strong prior for the temporal detection process. Experimental results demonstrate the effectiveness of our method in improving detection performance, achieving an NDS (nuScenes Detection Score) metric score of 37.35% on the nuScenes test set, which is 1.94 points higher than the baseline.
Funder
Baima Lake Laboratory Joint Funds of the Zhejiang Provincial Natural Science Foundation of China National Natural Science Foundation of China
Reference42 articles.
1. Mobile Trajectory Anomaly Detection: Taxonomy, Methodology, Challenges, and Directions;Kong;IEEE Inter. Things J.,2024 2. Kong, X., Lin, H., Jiang, R., and Shen, G. (2024). Anomalous Sub-Trajectory Detection With Graph Contrastive Self-Supervised Learning. IEEE Trans. Veh. Technol., 1–13. 3. Hu, P., Ziglar, J., Held, D., and Ramanan, D. (2020, January 13–19). What you see is what you get: Exploiting visibility for 3d object detection. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA. 4. Zhou, B., and Krähenbühl, P. (2022, January 18–24). Cross-view transformers for real-time map-view semantic segmentation. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA. 5. Li, Z., Wang, W., Li, H., Xie, E., Sima, C., Lu, T., Yu, Q., and Dai, J. (2022, January 23–27). Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel.
|
|