Abstract
Multi-sensor fusion is essential for an accurate and reliable autonomous driving system. Recently, BEVFusion has been proposed to integrate LiDAR and image features in a unified Bird's Eye View (BEV) representation. However, there is an issue with the loss of local image information during the extraction of global image information on the backbone network. In order to fully integrate local features with global features, this paper proposes a network called DS-BEV based on feature selection and refinement. It includes a Feature Selection Fusion module (FSM) and a Feature Refinement module (FRM). In the FSM, the features of different modal are first extracted by using specific networks and projected into a unified BEV representation space. Through channel and spatial learning, important information is selected from the initial features and fused to generate preliminary fusion features. Then,the image features extracted by a CNN network and the preliminary fusion features output by the FSM are sent to the FRM together. By combining the local features generated by CNN network, the fusion features are refined. We evaluate our model on the nuScenes dataset. Experiments show that our DS-BEV achieves 69.5% mAP and 72.3% NDS in detection accuracy.