Dynamic Context Removal: A General Training Strategy for Robust Models on Video Action Predictive Tasks-Reference-Cited by-同舟云学术

Dynamic Context Removal: A General Training Strategy for Robust Models on Video Action Predictive Tasks

Published:2023-08-13 Issue:12 Volume:131 Page:3272-3288
ISSN:0920-5691
Container-title:International Journal of Computer Vision
language:en
Short-container-title:Int J Comput Vis

Author:

Xu Xinyu,Li Yong-Lu^ORCID,Lu Cewu

Abstract

AbstractPredicting future actions is an essential feature of intelligent systems and embodied AI. However, compared to the traditional recognition tasks, the uncertainty of the future and the reasoning ability requirement make prediction tasks very challenging and far beyond solved. In this field, previous methods usually care more about the model architecture design but little attention has been put on how to train models with a proper learning policy. To this end, in this work, we propose a simple but effective training strategy, Dynamic Context Removal (DCR), which dynamically schedules the visibility of context in different training stages. It follows the human-like curriculum learning process, i.e., gradually removing the event context to increase the prediction difficulty till satisfying the final prediction target. Besides, we explore how to train robust models that give consistent predictions at different levels of observable context. Our learning scheme is plug-and-play and easy to integrate widely-used reasoning models including Transformer and LSTM, with advantages in both effectiveness and efficiency. We study two action prediction problems, i.e., Video Action Anticipation and Early Action Recognition. In extensive experiments, our method achieves state-of-the-art results on several widely-used benchmarks.

Funder

National Natural Science Foundation of China

Shanghai Municipal Science and Technology Major Project

SHEITC

Publisher

Springer Science and Business Media LLC

Subject

Artificial Intelligence,Computer Vision and Pattern Recognition,Software

Link

https://link.springer.com/content/pdf/10.1007/s11263-023-01850-6.pdf

Reference64 articles.

1. Alvarez, W. M., Moreno, F. M., Sipele, O., Smirnov, N., & Olaverri-Monreal, C. (2020). Autonomous driving: Framework for pedestrian intention estimation in a real world scenario. In 2020 IEEE intelligent vehicles symposium (IV) (pp. 39–44). IEEE.

2. Arnab, A., Dehghani, M., Heigold, G., Sun, C., Lučić, M., & Schmid, C. (2021). Vivit: A video vision transformer. Preprint retrieved from arXiv:2103.15691

3. Bengio, Y., Louradour, J., Collobert, R., & Weston, J. (2009). Curriculum learning. In Proceedings of the 26th annual international conference on machine learning (pp. 41–48).

4. Camporese, G., Coscia, P., Furnari, A., Farinella, G. M., & Ballan, L. (2021). Knowledge distillation for action anticipation via label smoothing. In 2020 25th international conference on pattern recognition (ICPR) (pp. 3312–3319). IEEE.

5. Carreira, J., & Zisserman, A. (2017). Quo vadis, action recognition? A new model and the kinetics dataset. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).

Cited by 1 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. From Easy to Hard: Learning Curricular Shape-Aware Features for Robust Panoptic Scene Graph Generation;International Journal of Computer Vision;2024-08-05