Multimodal Features Alignment for Vision–Language Object Tracking
-
Published:2024-03-27
Issue:7
Volume:16
Page:1168
-
ISSN:2072-4292
-
Container-title:Remote Sensing
-
language:en
-
Short-container-title:Remote Sensing
Author:
Ye Ping1ORCID, Xiao Gang1, Liu Jun2
Affiliation:
1. School of Aeronautics and Astronautics, Shanghai Jiao Tong University, Shanghai 200240, China 2. Artificial Intelligence Key Laboratory of Sichuan Province, Sichuan University of Science and Engineering, Yibin 644000, China
Abstract
Vision–language tracking presents a crucial challenge in multimodal object tracking. Integrating language features and visual features can enhance target localization and improve the stability and accuracy of the tracking process. However, most existing fusion models in vision–language trackers simply concatenate visual and linguistic features without considering their semantic relationships. Such methods fail to distinguish the target’s appearance features from the background, particularly when the target changes dramatically. To address these limitations, we introduce an innovative technique known as multimodal features alignment (MFA) for vision–language tracking. In contrast to basic concatenation methods, our approach employs a factorized bilinear pooling method that conducts squeezing and expanding operations to create a unified feature representation from visual and linguistic features. Moreover, we integrate the co-attention mechanism twice to derive varied weights for the search region, ensuring that higher weights are placed on the aligned visual and linguistic features. Subsequently, the fused feature map with diverse distributed weights serves as the search region during the tracking phase, facilitating anchor-free grounding to predict the target’s location. Extensive experiments are conducted on multiple public datasets, and our proposed tracker obtains a success score of 0.654/0.553/0.447 and a precision score of 0.872/0.556/0.513 on OTB-LANG/LaSOT/TNL2K. These results are satisfying compared with those of recent state-of-the-art vision–language trackers.
Reference52 articles.
1. Li, B., Yan, J., Wu, W., Zhu, Z., and Hu, X. (2018, January 18–23). High Performance Visual Tracking with Siamese Region Proposal Network. Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA. 2. Li, B., Wu, W., Wang, Q., Zhang, F., Xing, J., and Yan, J. (2019, January 15–20). SiamRPN++: Evolution of Siamese Visual Tracking with Very Deep Networks. Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA. 3. Bertinetto, L., Valmadre, J., Henriques, J.F., Vedaldi, A., and Torr, P.H.S. (15–16, January 8–10). Fully-Convolutional Siamese Networks for Object Tracking. Proceedings of the ECCV Workshops, Amsterdam, The Netherlands. 4. Danelljan, M., Bhat, G., Khan, F.S., and Felsberg, M. (2017, January 21–26). ECO: Efficient Convolution Operators for Tracking. Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA. 5. Jiang, M., Guo, S., Luo, H., Yao, Y., and Cui, G. (2023). A Robust Target Tracking Method for Crowded Indoor Environments Using mmWave Radar. Remote Sens., 15.
|
|