YOLO Series for Human Hand Action Detection and Classification from Egocentric Videos
Author:
Nguyen Hung-Cuong1ORCID, Nguyen Thi-Hao1, Scherer Rafał2ORCID, Le Van-Hung3
Affiliation:
1. Faculty of Engineering Technology, Hung Vuong University, Viet Tri City 35100, Vietnam 2. Department of Intelligent Computer Systems, Czestochowa University of Technology, 42-218 Czestochowa, Poland 3. Faculty of Basic Science, Tan Trao University, Tuyen Quang City 22000, Vietnam
Abstract
Hand detection and classification is a very important pre-processing step in building applications based on three-dimensional (3D) hand pose estimation and hand activity recognition. To automatically limit the hand data area on egocentric vision (EV) datasets, especially to see the development and performance of the “You Only Live Once” (YOLO) network over the past seven years, we propose a study comparing the efficiency of hand detection and classification based on the YOLO-family networks. This study is based on the following problems: (1) systematizing all architectures, advantages, and disadvantages of YOLO-family networks from version (v)1 to v7; (2) preparing ground-truth data for pre-trained models and evaluation models of hand detection and classification on EV datasets (FPHAB, HOI4D, RehabHand); (3) fine-tuning the hand detection and classification model based on the YOLO-family networks, hand detection, and classification evaluation on the EV datasets. Hand detection and classification results on the YOLOv7 network and its variations were the best across all three datasets. The results of the YOLOv7-w6 network are as follows: FPHAB is P = 97% with TheshIOU = 0.5; HOI4D is P = 95% with TheshIOU = 0.5; RehabHand is larger than 95% with TheshIOU = 0.5; the processing speed of YOLOv7-w6 is 60 fps with a resolution of 1280 × 1280 pixels and that of YOLOv7 is 133 fps with a resolution of 640 × 640 pixels.
Funder
Hung Vuong University Polish Minister of Science and Higher Education Tan Trao University
Subject
Electrical and Electronic Engineering,Biochemistry,Instrumentation,Atomic and Molecular Physics, and Optics,Analytical Chemistry
Reference50 articles.
1. Real-time continuous pose recovery of human hands using convolutional networks;Tompson;ACM Trans. Graph.,2014 2. Latent regression forest: Structured estimation of 3D hand poses;Tang;IEEE Trans. Pattern Anal. Mach. Intell.,2017 3. Sun, X., Wei, Y., Liang, S., Tang, X., and Sun, J. (2015, January 7–12). Cascaded hand pose regression. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Boston, MA, USA. 4. Garcia-Hernando, G., Yuan, S., Baek, S., and Kim, T.K. (2018, January 18–22). First-Person Hand Action Benchmark with RGB-D Videos and 3D Hand Pose Annotations. Proceedings of the Proceedings of Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA. 5. Lugaresi, C., Tang, J., Nash, H., McClanahan, C., Uboweja, E., Hays, M., Zhang, F., Chang, C.L., Yong, M.G., and Lee, J. (2019). MediaPipe: A Framework for Building Perception Pipelines. arXiv.
Cited by
5 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
|
|