Affiliation:
1. College of Computer Science and Technology Changchun University of Science and Technology Changchun China
Abstract
AbstractObject tracking is an essential component of computer vision and plays a significant role in various practical applications. Recently, transformer‐based trackers have become the predominant method for tracking due to their robustness and efficiency. However, existing transformer‐based trackers typically focus solely on the template features, neglecting the interactions between the search features and the template features during the tracking process. To address this issue, this article introduces a multi‐head cross‐attention transformer for visual tracking (MCTT), which effectively enhance the interaction between the template branch and the search branch, enabling the tracker to prioritize discriminative feature. Additionally, an auxiliary segmentation mask head has been designed to produce a pixel‐level feature representation, enhancing and tracking accuracy by predicting a set of binary masks. Comprehensive experiments have been performed on benchmark datasets, such as LaSOT, GOT‐10k, UAV123 and TrackingNet using various advanced methods, demonstrating that our approach achieves promising tracking performance. MCTT achieves an AO score of 72.8 on the GOT‐10k.