Abstract
In the field of autonomous driving, a commonly employed method to enhance detection accuracy and robustness is the fusion of multi-sensor perception. The fusion of millimeter-wave radar and camera can effectively complement each other, providing sufficient semantic information while ensuring robustness against varying illumination and weather conditions, at a lower cost. In this paper, we focus on the fusion of millimeter-wave radar point cloud features and image features, proposing a multi-level multi-attention feature-level fusion method. By improving the DLA34 backbone network to expand the model's receptive field, we fuse point cloud features at multiple levels with image features and utilize an improved feature pyramid to handle features of both modalities, ensuring good cross-channel information capture capability. Our model leverages the advantages of multi-level multi-attention, achieving an accuracy of 34.3% in the challenging nuScenes dataset, demonstrating promising performance.