Abstract
AbstractAssociation, aiming to link bounding boxes of the same identity in a video sequence, is a central component in multi-object tracking (MOT). To train association modules, e.g., parametric networks, real video data are usually used. However, annotating person tracks in consecutive video frames is expensive, and such real data, due to its inflexibility, offer us limited opportunities to evaluate the system performance w.r.t. changing tracking scenarios. In this paper, we study whether 3D synthetic data can replace real-world videos for association training. Specifically, we introduce a large-scale synthetic data engine named MOTX, where the motion characteristics of cameras and objects are manually configured to be similar to those of real-world datasets. We show that, compared with real data, association knowledge obtained from synthetic data can achieve very similar performance on real-world test sets without domain adaption techniques. Our intriguing observation is credited to two factors. First and foremost, 3D engines can well simulate motion factors such as camera movement, camera view, and object movement so that the simulated videos can provide association modules with effective motion features. Second, the experimental results show that the appearance domain gap hardly harms the learning of association knowledge. In addition, the strong customization ability of MOTX allows us to quantitatively assess the impact of motion factors on MOT, which brings new insights to the community.
Publisher
Springer Science and Business Media LLC
Subject
Applied Mathematics,Artificial Intelligence,Computer Networks and Communications,Computer Science Applications,Computer Vision and Pattern Recognition,Modeling and Simulation,Signal Processing,Control and Systems Engineering
Reference36 articles.
1. G. Brasó, L. Leal-Taixé. Learning a neural solver for multiple object tracking. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Seattle, USA, pp. 6246–6256, 2020. DOI: https://doi.org/10.1109/CVPR42600.2020.00628.
2. Y. H. Xu, A. Ŝep, Y. T. Ban, R. Horaud, L. Leal-Taixé, X. Alameda-Pineda. How to train your deep multi-object tracker. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Seattle, USA, pp. 6786–6795, 2020. DOI: https://doi.org/10.1109/CVPR42600.2020.00682.
3. L. Leal-Taixé, A. Milan, I. Reid, S. Roth, K. Schindler. MOTChallenge 2015: Towards a benchmark for multi-target tracking. [Online], Available: https://arxiv.org/abs/1504.01942, 2015.
4. A. Milan, L. Leal-Taixé, I. Reid, S. Roth, K. Schindler. MOT16: A benchmark for multi-object tracking. [Online], Available: https://arxiv.org/abs/1603.00831, 2016.
5. S. Bąk, P. Carr, J. F. Lalonde. Domain adaptation through synthesis for unsupervised person re-identification. In Proceedings of the 15th European Conference on Computer Vision, Springer, Munich, Germany, pp. 193–209, 2018. DOI: https://doi.org/10.1007/978-3-030-01261-8_12.
Cited by
1 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
1. Spatial-Temporal Graph U-Net for Skeleton-Based Human Motion Infilling;2024 IEEE International Conference on Industrial Technology (ICIT);2024-03-25