STFormer: Spatio‐temporal former for hand–object interaction recognition from egocentric RGB video-Reference-Cited by-同舟云学术

STFormer: Spatio‐temporal former for hand–object interaction recognition from egocentric RGB video

Published:2024-09 Issue:17 Volume:60 Page:
ISSN:0013-5194
Container-title:Electronics Letters
language:en
Short-container-title:Electronics Letters

Author:

Liang Jiao¹²^ORCID,Wang Xihan¹²,Yang Jiayi¹²,Gao Quanli¹²

Affiliation:

1. State‐Province Joint Engineering and Research Center of Advanced Networking and Intelligent Information Services Xi'an Polytechnic University Xi'an China

2. School of Computer Science Xi'an Polytechnic University Xi'an China

Abstract

AbstractIn recent years, video‐based hand–object interaction has received widespread attention from researchers. However, due to the complexity and occlusion of hand movements, hand–object interaction recognition based on RGB videos remains a highly challenging task. Here, an end‐to‐end spatio‐temporal former (STFormer) network for understanding hand behaviour in interactions is proposed. The network consists of three modules: FlexiViT feature extraction, hand–object pose estimator, and interaction action classifier. The FlexiViT is used to extract multi‐scale features from each image frame. The hand–object pose estimator is designed to predict 3D hand pose keypoints and object labels for each frame. The interaction action classifier is used to predict the interaction action categories for the entire video. The experimental results demonstrate that our approach achieves competitive recognition accuracies of 94.96% and 88.84% on two datasets, namely first‐person hand action (FPHA) and 2 Hands and Objects (H2O).

Funder

National Natural Science Foundation of China

Publisher

Institution of Engineering and Technology (IET)

Reference16 articles.

1. Duan H. et al.:Revisiting skeleton‐based action recognition. In: CVPR (2022)

2. Hatano M. et al.:Multimodal cross‐domain few‐shot learning for egocentric action recognition. In: ECCV (2024)

3. Aboukhadra A.T. et al.:THOR‐Net: End‐to‐end graformer‐based realistic two hands and object reconstruction with self‐supervision. In: WACV (2023)

4. Feichtenhofer C. et al.:Slowfast networks for video recognition. In: ICCV (2019)

5. Carreira J. et al.:Quo vadis action recognition? A new model and the kinetics dataset. In: CVPR (2017)