Macaron Attention: The Local Squeezing Global Attention Mechanism in Tracking Tasks
-
Published:2024-08-08
Issue:16
Volume:16
Page:2896
-
ISSN:2072-4292
-
Container-title:Remote Sensing
-
language:en
-
Short-container-title:Remote Sensing
Author:
Wang Zhixing12345, Luo Hui13ORCID, Liu Dongxu13ORCID, Li Meihui13ORCID, Liu Yunfeng13, Bao Qiliang13, Zhang Jianlin13ORCID
Affiliation:
1. Institute of Optics and Electronics, Chinese Academy of Sciences, Beijing 100045, China 2. School of Information and Communication Engineering, University of Electronic Science and Technology of China, Chengdu 611731, China 3. Key Laboratory of Optical Engineering, Chinese Academy of Sciences, Beijing 100045, China 4. School of Electronic, Electrical and Communication Engineering, Chinese Academy of Sciences, Beijing 100049, China 5. National Key Laboratory of Optical Field Manipulation Science and Technology, Chinese Academy of Sciences, Beijing 100045, China
Abstract
The Unmanned Aerial Vehicle (UAV) tracking tasks find extensive utility across various applications. However, current Transformer-based trackers are generally tailored for diverse scenarios and lack specific designs for UAV applications. Moreover, due to the complexity of training in tracking tasks, existing models strive to improve tracking performance within limited scales, making it challenging to directly apply lightweight designs. To address these challenges, we introduce an efficient attention mechanism known as Macaron Attention, which we integrate into the existing UAV tracking framework to enhance the model’s discriminative ability within these constraints. Specifically, our attention mechanism comprises three components, fixed window attention (FWA), local squeezing global attention (LSGA), and conventional global attention (CGA), collectively forming a Macaron-style attention implementation. Firstly, the FWA module addresses the multi-scale issue in UAVs by cropping tokens within a fixed window scale in the spatial domain. Secondly, in LSGA, to adapt to the scale variation, we employ an adaptive clustering-based token aggregation strategy and design a “window-to-window” fusion attention model to integrate global attention with local attention. Finally, the CGA module is applied to prevent matrix rank collapse and improve tracking performance. By using the FWA, LSGA, and CGA modules, we propose a brand-new tracking model named MATrack. The UAV123 benchmark is the major evaluation dataset of MATrack with 0.710 and 0.911 on success and precision, individually.
Funder
National Natural Science Foundation of China Postdoctoral Fellowship Program of CPSF
Reference45 articles.
1. Iandola, F.N., Han, S., Moskewicz, M.W., Ashraf, K., Dally, W.J., and Keutzer, K. (2016). SqueezeNet: AlexNet-level accuracy with 50× fewer parameters and <0.5 MB model size. arXiv. 2. Bertinetto, L., Valmadre, J., Henriques, J.F., Vedaldi, A., and Torr, P.H. (2016, January 8–16). Fully-convolutional siamese networks for object tracking. Proceedings of the Computer Vision–ECCV 2016 Workshops, Amsterdam, The Netherlands. 3. He, K., Zhang, X., Ren, S., and Sun, J. (2016, January 27–30). Deep residual learning for image recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA. 4. Li, B., Wu, W., Wang, Q., Zhang, F., Xing, J., and Yan, J. (2019, January 5–20). Siamrpn++: Evolution of siamese visual tracking with very deep networks. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA. 5. Li, B., Yan, J., Wu, W., Zhu, Z., and Hu, X. (2018, January 18–22). High performance visual tracking with siamese region proposal network. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA.
|
|