Query-Based Object Visual Tracking with Parallel Sequence Generation-Reference-Cited by-同舟云学术

Query-Based Object Visual Tracking with Parallel Sequence Generation

Published:2024-07-24 Issue:15 Volume:24 Page:4802
ISSN:1424-8220
Container-title:Sensors
language:en
Short-container-title:Sensors

Author:

Liu Chang¹,Zhang Bin¹,Bo Chunjuan²,Wang Dong¹

Affiliation:

1. School of Information and Communication Engineering, Dalian University of Technology, Dalian 116024, China

2. School of Information and Communication Engineering, Dalian Minzu University, Dalian 116600, China

Abstract

Query decoders have been shown to achieve good performance in object detection. However, they suffer from insufficient object tracking performance. Sequence-to-sequence learning in this context has recently been explored, with the idea of describing a target as a sequence of discrete tokens. In this study, we experimentally determine that, with appropriate representation, a parallel approach for predicting a target coordinate sequence with a query decoder can achieve good performance and speed. We propose a concise query-based tracking framework for predicting a target coordinate sequence in a parallel manner, named QPSTrack. A set of queries are designed to be responsible for different coordinates of the tracked target. All the queries jointly represent a target rather than a traditional one-to-one matching pattern between the query and target. Moreover, we adopt an adaptive decoding scheme including a one-layer adaptive decoder and learnable adaptive inputs for the decoder. This decoding scheme assists the queries in decoding the template-guided search features better. Furthermore, we explore the use of the plain ViT-Base, ViT-Large, and lightweight hierarchical LeViT architectures as the encoder backbone, providing a family of three variants in total. All the trackers are found to obtain a good trade-off between speed and performance; for instance, our tracker QPSTrack-B256 with the ViT-Base encoder achieves a 69.1% AUC on the LaSOT benchmark at 104.8 FPS.

Funder

National Natural Science Foundation of China

Talent Fund of Liaoning Province

Excellent Science and Technique Talent Foundation of Dalian

Publisher

MDPI AG

Link

https://www.mdpi.com/1424-8220/24/15/4802/pdf

Reference48 articles.

1. Wang, N., Zhou, W., Wang, J., and Li, H. (2021, January 20–25). Transformer Meets Tracker: Exploiting Temporal Context for Robust Visual Tracking. Proceedings of the CVPR, Nashville, TN, USA.

2. Yan, B., Peng, H., Fu, J., Wang, D., and Lu, H. (2021, January 11–17). Learning spatio-temporal transformer for visual tracking. Proceedings of the ICCV, Montreal, BC, Canada.

3. Cui, Y., Jiang, C., Wang, L., and Wu, G. (2021). Target transformed regression for accurate tracking. arXiv.

4. Chen, X., Yan, B., Zhu, J., Wang, D., Yang, X., and Lu, H. (2021, January 20–25). Transformer tracking. Proceedings of the CVPR, Nashville, TN, USA.

5. Ye, B., Chang, H., Ma, B., Shan, S., and Chen, X. (2022, January 23–27). Joint feature learning and relation modeling for tracking: A one-stream framework. Proceedings of the ECCV, Tel Aviv, Israel.