3D hand pose and shape estimation from monocular RGB via efficient 2D cues-Reference-Cited by-同舟云学术

3D hand pose and shape estimation from monocular RGB via efficient 2D cues

Published:2023-11-30 Issue:1 Volume:10 Page:79-96
ISSN:2096-0433
Container-title:Computational Visual Media
language:en
Short-container-title:Comp. Visual Media

Author:

Zhang Fenghao,Zhao Lin,Li Shengling,Su Wanjuan,Liu Liman,Tao Wenbing

Abstract

AbstractEstimating 3D hand shape from a single-view RGB image is important for many applications. However, the diversity of hand shapes and postures, depth ambiguity, and occlusion may result in pose errors and noisy hand meshes. Making full use of 2D cues such as 2D pose can effectively improve the quality of 3D human hand shape estimation. In this paper, we use 2D joint heatmaps to obtain spatial details for robust pose estimation. We also introduce a depth-independent 2D mesh to avoid depth ambiguity in mesh regression for efficient hand-image alignment. Our method has four cascaded stages: 2D cue extraction, pose feature encoding, initial reconstruction, and reconstruction refinement. Specifically, we first encode the image to determine semantic features during 2D cue extraction; this is also used to predict hand joints and for segmentation. Then, during the pose feature encoding stage, we use a hand joints encoder to learn spatial information from the joint heatmaps. Next, a coarse 3D hand mesh and 2D mesh are obtained in the initial reconstruction step; a mesh squeeze-and-excitation block is used to fuse different hand features to enhance perception of 3D hand structures. Finally, a global mesh refinement stage learns non-local relations between vertices of the hand mesh from the predicted 2D mesh, to predict an offset hand mesh to fine-tune the reconstruction results. Quantitative and qualitative results on the FreiHAND benchmark dataset demonstrate that our approach achieves state-of-the-art performance.

Publisher

Springer Science and Business Media LLC

Subject

Artificial Intelligence,Computer Graphics and Computer-Aided Design,Computer Vision and Pattern Recognition

Link

https://link.springer.com/content/pdf/10.1007/s41095-023-0346-4.pdf

Reference59 articles.

1. Jang, Y.; Noh, S. T.; Chang, H. J.; Kim, T. K.; Woo, W. 3D finger CAPE: Clicking action and position estimation under self-occlusions in egocentric viewpoint. IEEE Transactions on Visualization and Computer Graphies Vol. 21, No. 4, 501–510, 2015.

2. Lee, T.; Hollerer, T. Multithreaded hybrid feature tracking for markerless augmented reality. IEEE Transactions on Visualization and Computer Graphics Vol. 15, No. 3, 355–368, 2009.

3. Piumsomboon, T.; Clark, A.; Billinghurst, M.; Cockburn, A. User-defined gestures for augmented reality. In: Human-Computer Interaction–INTERACT 2013. Lecture Notes in Computer Science, Vol. 8118. Kotzé, P.; Marsden, G.; Lindgaard, G.; Wesson, J.; Winckler, M. Eds. Springer Berlin Heidelberg, 282–299, 2013.

4. Kikuchi, T.; Endo, Y.; Kanamori, Y.; Hashimoto, T.; Mitani, J. Transferring pose and augmenting background for deep human-image parsing and its applications. Computational Visual Media Vol. 4, No. 1, 43–54, 2018.

5. Wang, M.; Lyu, X. Q.; Li, Y. J.; Zhang, F. L. VR content creation and exploration with deep learning: A survey. Computational Visual Media Vol. 6, No. 1, 3–28, 2020.