Enhancing Recognition of Human–Object Interaction from Visual Data Using Egocentric Wearable Camera-Reference-Cited by-同舟云学术

Enhancing Recognition of Human–Object Interaction from Visual Data Using Egocentric Wearable Camera

Published:2024-07-29 Issue:8 Volume:16 Page:269
ISSN:1999-5903
Container-title:Future Internet
language:en
Short-container-title:Future Internet

Author:

Hamid Danish¹,Haq Muhammad Ehatisham Ul¹,Yasin Amanullah¹^ORCID,Murtaza Fiza¹,Azam Muhammad Awais²^ORCID

Affiliation:

1. Department of Creative Technologies, Faculty of Computing and Artificial Intelligence (FCAI), Air University, Islamabad 44000, Pakistan

2. Technology and Innovation Research Group, School of Information Technology, Whitecliffe, Wellington 6145, New Zealand

Abstract

Object detection and human action recognition have great significance in many real-world applications. Understanding how a human being interacts with different objects, i.e., human–object interaction, is also crucial in this regard since it enables diverse applications related to security, surveillance, and immersive reality. Thus, this study explored the potential of using a wearable camera for object detection and human–object interaction recognition, which is a key technology for the future Internet and ubiquitous computing. We propose a system that uses an egocentric camera view to recognize objects and human–object interactions by analyzing the wearer’s hand pose. Our novel idea leverages the hand joint data of the user, which were extracted from the egocentric camera view, for recognizing different objects and related interactions. Traditional methods for human–object interaction rely on a third-person, i.e., exocentric, camera view by extracting morphological and color/texture-related features, and thus, often fall short when faced with occlusion, camera variations, and background clutter. Moreover, deep learning-based approaches in this regard necessitate substantial data for training, leading to a significant computational overhead. Our proposed approach capitalizes on hand joint data captured from an egocentric perspective, offering a robust solution to the limitations of traditional methods. We propose a machine learning-based innovative technique for feature extraction and description from 3D hand joint data by presenting two distinct approaches: object-dependent and object-independent interaction recognition. The proposed method offered advantages in computational efficiency compared with deep learning methods and was validated using the publicly available HOI4D dataset, where it achieved a best-case average F1-score of 74%. The proposed system paves the way for intuitive human–computer collaboration within the future Internet, enabling applications like seamless object manipulation and natural user interfaces for smart devices, human–robot interactions, virtual reality, and augmented reality.

Funder

School of Information Technology, Whitecliffe, Wellington, New Zealand

Air University, Islamabad, Pakistan

Publisher

MDPI AG

Link

https://www.mdpi.com/1999-5903/16/8/269/pdf

Reference35 articles.

1. Gupta, S., and Malik, J. (2015). Visual semantic role labeling. arXiv.

2. Hou, Z., Yu, B., Qiao, Y., Peng, X., and Tao, D. (2021, January 20–25). Affordance transfer learning for human-object interaction detection. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA.

3. Few-shot human–object interaction video recognition with transformers;Li;Neural Netw.,2023

4. Chao, Y.W., Wang, Z., He, Y., Wang, J., and Deng, J. (2015, January 7–13). Hico: A benchmark for recognizing human-object interactions in images. Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile.

5. Sadhu, A., Gupta, T., Yatskar, M., Nevatia, R., and Kembhavi, A. (2021, January 20–25). Visual semantic role labeling for video understanding. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA.