Central Attention with Sliding Window for Efficient Visual Tracking


Chen Zhen1,Xiao Xianbing1,Xiong Xingzhong1,Meng Fanqin1,Liu Jun1


1. Sichuan University of Science and Engineering


Abstract Cross-correlation is often used for feature fusion, especially in Siamese-based trackers. However, capturing complex nonlinear relationships is challenging and susceptible to outliers in the sample. Recently, researchers have used Transformers for feature fusion and achieved more significant performance. However, most rely on modeling global token relationships, which can destroy the local and spatial correlations inherent in 2D structures. This paper proposes an efficient tracking algorithm based on central attention and sliding window sampling called SiamCAT. Specifically, significant context augments with sliding windows are suggested to maintain the stability of the 2D input spatial structure. It is based on attention to simulate the processing of 2D data by convolution, and the internal memory composed of learnable parameters realizes the dynamic adjustment of the attention layer. Second, to learn efficient feature fusion, this paper constructs a feature fusion network to effectively combine template features and search features. Experiments show that SiamCAT achieves state-of-the-art results on LaSOT, OTB100, NFS, UAV123, GOT10K, and TrackingNet benchmark and runs in real-time at 47 frames per second on the CPU. The code will be released in https://github.com/cnchange/SiamCAT.


Research Square Platform LLC

Reference103 articles.

1. Bertinetto, Luca and Valmadre, Jack and Henriques, Jo{\ a}o F. and Vedaldi, Andrea and Torr, Philip H. S. (2016) Fully-{{Convolutional Siamese Networks}} for {{Object Tracking}}. {Springer International Publishing}, {Cham}, D:\Project_Files\Zotero\storage\F4W539S7\Bertinetto 等 - 2016 - Fully-Convolutional Siamese Networks for Object Tr.pdf, Deep-learning,Object-tracking,Siamese-network,siamfc,Similarity-learning, english, 978-3-319-48881-3, The problem of arbitrary object tracking has traditionally been tackled by learning a model of the object's appearance exclusively online, using as sole training data the video itself. Despite the success of these methods, their online-only approach inherently limits the richness of the model they can learn. Recently, several attempts have been made to exploit the expressive power of deep convolutional networks. However, when the object to track is not known beforehand, it is necessary to perform Stochastic Gradient Descent online to adapt the weights of the network, severely compromising the speed of the system. In this paper we equip a basic tracking algorithm with a novel fully-convolutional Siamese network trained end-to-end on the ILSVRC15 dataset for object detection in video. Our tracker operates at frame-rates beyond real-time and, despite its extreme simplicity, achieves state-of-the-art performance in multiple benchmarks., 10.1007/978-3-319-48881-3_56, 850--865, Lecture {{Notes}} in {{Computer Science}}, Hua, Gang and J{\'e}gou, Herv{\'e}, Computer {{Vision}} \textendash{} {{ECCV}} 2016 {{Workshops}}

2. Bhat, Goutam and Danelljan, Martin and Van Gool, Luc and Timofte, Radu (2020) Know Your Surroundings: {{Exploiting}} Scene Information for Object Tracking. {Springer}, D\:\\Project_Files\\Zotero\\storage\\FB86937J\\Bhat 等 - 2020 - Know your surroundings Exploiting scene informati.pdf;D\:\\Project_Files\\Zotero\\storage\\S4AVR9DL\\978-3-030-58592-1_13.html, 205--221, Computer {{Vision}}\textendash{{ECCV}} 2020: 16th {{European Conference}}, {{Glasgow}}, {{UK}}, {{August}} 23\textendash 28, 2020, {{Proceedings}}, {{Part XXIII}} 16, Know Your Surroundings

3. Bhat, Goutam and Danelljan, Martin and Gool, Luc Van and Timofte, Radu (2019) Learning {{Discriminative Model Prediction}} for {{Tracking}}. D:\Project_Files\Zotero\storage\R67JRDJE\Bhat 等 - 2019 - Learning Discriminative Model Prediction for Track.pdf, dimp-50, 2023-07-17, 6182--6191, Proceedings of the {{IEEE}}/{{CVF International Conference}} on {{Computer Vision}}

4. Bhat, Goutam and Danelljan, Martin and Gool, Luc Van and Timofte, Radu (2019) Learning Discriminative Model Prediction for Tracking. D\:\\Project_Files\\Zotero\\storage\\WD3YB3CU\\Bhat 等 - 2019 - Learning discriminative model prediction for track.pdf;D\:\\Project_Files\\Zotero\\storage\\B86BAQKV\\Bhat_Learning_Discriminative_Model_Prediction_for_Tracking_ICCV_2019_paper.html, dimp, 6182--6191, Proceedings of the {{IEEE}}/{{CVF}} International Conference on Computer Vision

5. Blatter, Philippe and Kanakis, Menelaos and Danelljan, Martin and Van Gool, Luc (2023) Efficient {{Visual Tracking With Exemplar Transformers}}. D:\Project_Files\Zotero\storage\XREVU329\Blatter 等 - 2023 - Efficient Visual Tracking With Exemplar Transforme.pdf, E.T.track, english, 2023-07-17, 1571--1581, Proceedings of the {{IEEE}}/{{CVF Winter Conference}} on {{Applications}} of {{Computer Vision}}








Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3