Affiliation:
1. School of Engineering and Science, Tecnologico de Monterrey, Nuevo León 64700, Mexico
2. KODS.ai, Mexico City 11510, Mexico
Abstract
Training a model to recognize human actions in videos is computationally intensive. While modern strategies employ transfer learning methods to make the process more efficient, they still face challenges regarding flexibility and efficiency. Existing solutions are limited in functionality and rely heavily on pretrained architectures, which can restrict their applicability to diverse scenarios. Our work explores knowledge distillation (KD) for enhancing the training of self-supervised video models in three aspects: improving classification accuracy, accelerating model convergence, and increasing model flexibility under regular and limited-data scenarios. We tested our method on the UCF101 dataset using differently balanced proportions: 100%, 50%, 25%, and 2%. We found that using knowledge distillation to guide the model’s training outperforms traditional training without affecting the classification accuracy and while reducing the convergence rate of model training in standard settings and a data-scarce environment. Additionally, knowledge distillation enables cross-architecture flexibility, allowing model customization for various applications: from resource-limited to high-performance scenarios.
Reference61 articles.
1. A combined multiple action recognition and summarization for surveillance video sequences;Elharrouss;Appl. Intell.,2021
2. Carreira, J., and Zisserman, A. (2017, January 21–26). Quo vadis, action recognition? a new model and the kinetics dataset. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA.
3. Akyon, F.C., and Temizel, A. (2022). Deep Architectures for Content Moderation and Movie Content Rating. arXiv.
4. Gul, M.A., Yousaf, M.H., Nawaz, S., Ur Rehman, Z., and Kim, H. (2020). Patient monitoring by abnormal human activity recognition based on CNN architecture. Electronics, 9.
5. Exploring the trade-off between accuracy and observational latency in action recognition;Ellis;Int. J. Comput. Vis.,2013