Query-Based Object Visual Tracking with Parallel Sequence Generation
Author:
Liu Chang1, Zhang Bin1, Bo Chunjuan2, Wang Dong1
Affiliation:
1. School of Information and Communication Engineering, Dalian University of Technology, Dalian 116024, China 2. School of Information and Communication Engineering, Dalian Minzu University, Dalian 116600, China
Abstract
Query decoders have been shown to achieve good performance in object detection. However, they suffer from insufficient object tracking performance. Sequence-to-sequence learning in this context has recently been explored, with the idea of describing a target as a sequence of discrete tokens. In this study, we experimentally determine that, with appropriate representation, a parallel approach for predicting a target coordinate sequence with a query decoder can achieve good performance and speed. We propose a concise query-based tracking framework for predicting a target coordinate sequence in a parallel manner, named QPSTrack. A set of queries are designed to be responsible for different coordinates of the tracked target. All the queries jointly represent a target rather than a traditional one-to-one matching pattern between the query and target. Moreover, we adopt an adaptive decoding scheme including a one-layer adaptive decoder and learnable adaptive inputs for the decoder. This decoding scheme assists the queries in decoding the template-guided search features better. Furthermore, we explore the use of the plain ViT-Base, ViT-Large, and lightweight hierarchical LeViT architectures as the encoder backbone, providing a family of three variants in total. All the trackers are found to obtain a good trade-off between speed and performance; for instance, our tracker QPSTrack-B256 with the ViT-Base encoder achieves a 69.1% AUC on the LaSOT benchmark at 104.8 FPS.
Funder
National Natural Science Foundation of China Talent Fund of Liaoning Province Excellent Science and Technique Talent Foundation of Dalian
Reference48 articles.
1. Wang, N., Zhou, W., Wang, J., and Li, H. (2021, January 20–25). Transformer Meets Tracker: Exploiting Temporal Context for Robust Visual Tracking. Proceedings of the CVPR, Nashville, TN, USA. 2. Yan, B., Peng, H., Fu, J., Wang, D., and Lu, H. (2021, January 11–17). Learning spatio-temporal transformer for visual tracking. Proceedings of the ICCV, Montreal, BC, Canada. 3. Cui, Y., Jiang, C., Wang, L., and Wu, G. (2021). Target transformed regression for accurate tracking. arXiv. 4. Chen, X., Yan, B., Zhu, J., Wang, D., Yang, X., and Lu, H. (2021, January 20–25). Transformer tracking. Proceedings of the CVPR, Nashville, TN, USA. 5. Ye, B., Chang, H., Ma, B., Shan, S., and Chen, X. (2022, January 23–27). Joint feature learning and relation modeling for tracking: A one-stream framework. Proceedings of the ECCV, Tel Aviv, Israel.
|
|