CVTrack: Combined Convolutional Neural Network and Vision Transformer Fusion Model for Visual Tracking-Reference-Cited by-同舟云学术

CVTrack: Combined Convolutional Neural Network and Vision Transformer Fusion Model for Visual Tracking

Published:2024-01-03 Issue:1 Volume:24 Page:274
ISSN:1424-8220
Container-title:Sensors
language:en
Short-container-title:Sensors

Author:

Wang Jian¹²,Song Yueming¹²,Song Ce¹,Tian Haonan¹^ORCID,Zhang Shuai¹,Sun Jinghui¹

Affiliation:

1. Changchun Institute of Optics, Fine Mechanics and Physics, Chinese Academy of Sciences, Changchun 130033, China

2. University of Chinese Academy of Sciences, Beijing 101408, China

Abstract

Most single-object trackers currently employ either a convolutional neural network (CNN) or a vision transformer as the backbone for object tracking. In CNNs, convolutional operations excel at extracting local features but struggle to capture global representations. On the other hand, vision transformers utilize cascaded self-attention modules to capture long-range feature dependencies but may overlook local feature details. To address these limitations, we propose a target-tracking algorithm called CVTrack, which leverages a parallel dual-branch backbone network combining CNN and Transformer for feature extraction and fusion. Firstly, CVTrack utilizes a parallel dual-branch feature extraction network with CNN and transformer branches to extract local and global features from the input image. Through bidirectional information interaction channels, the local features from the CNN branch and the global features from the transformer branch are able to interact and fuse information effectively. Secondly, deep cross-correlation operations and transformer-based methods are employed to fuse the template and search region features, enabling comprehensive interaction between them. Subsequently, the fused features are fed into the prediction module to accomplish the object-tracking task. Our tracker achieves state-of-the-art performance on five benchmark datasets while maintaining real-time execution speed. Finally, we conduct ablation studies to demonstrate the efficacy of each module in the parallel dual-branch feature extraction backbone network.

Publisher

MDPI AG

Subject

Electrical and Electronic Engineering,Biochemistry,Instrumentation,Atomic and Molecular Physics, and Optics,Analytical Chemistry

Link

https://www.mdpi.com/1424-8220/24/1/274/pdf

Reference65 articles.

1. Bolme, D.S., Beveridge, J.R., Draper, B.A., and Lui, Y.M. (2010, January 13–18). Visual Object Tracking Using Adaptive Correlation Filters. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, San Francisco, CA, USA.

2. Exploiting the circulant structure of tracking-by-detection with kernels;Henriques;Lect. Notes Comput. Sci.,2012

3. High-speed tracking with kernelized correlation filters;Henriques;IEEE Trans. Pattern Anal. Mach. Intell.,2014

4. Kiani Galoogahi, H., Fagg, A., and Lucey, S. (2017, January 19–22). Learning Background-aware Correlation Filters for Visual Tracking. Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy.

5. Mueller, M., Smith, N., and Ghanem, B. (2017, January 21–26). Context-Aware Correlation Filter Tracking. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA.