Abstract
Remote sensing image (RSI) target detection methods based on traditional multi scale feature fusion (MSFF) have achieved great success. However, the traditional MSFF method significantly increases the computational cost during model training and inference, and the simple fusion operation may lead to the semantic confusion of the feature map, which cannot realize the refined extraction of features by the model. In order to reduce the computational effort associated with the MSFF operation and to enable the features in the feature map to present an accurate, fine-grained distribution, we propose a single-stage detection model(RS-YOLO). Our main additions to RS-YOLO are a computationally smaller and faster QS-E-ELEN (Quick and Small E-ELEN) module and a feature refinement extraction (FRE) module. In the QS-E-ELEN module We utilize QSBlock,jump-join, and convolution operations to fuse features on different scales and reduce the computational effort of the model by exploiting the similarity of the RSI feature map channels. In order for the model to better utilize the enhanced features, FRE makes the feature mapping of the target to be detected in the RSI accurate and refined. By conducting experiments on the popular NWPU-VHR- 10 and SSDD datasets, we derive results that show that RS-YOLO outperforms most mainstream models in terms of the trade-off between accuracy and speed. Specifically, in terms of accuracy, it improves 1.6% and 1.7% compared to the current state-of-the-art models, respectively. At the same time, RS-YOLO reduces the number of parameters and computational effort.