Two-stream Multi-level Dynamic Point Transformer for Two-person Interaction Recognition-Reference-Cited by-同舟云学术

Two-stream Multi-level Dynamic Point Transformer for Two-person Interaction Recognition

Published:2024-02-07 Issue:5 Volume:20 Page:1-22
ISSN:1551-6857
Container-title:ACM Transactions on Multimedia Computing, Communications, and Applications
language:en
Short-container-title:ACM Trans. Multimedia Comput. Commun. Appl.

Author:

Liu Yao¹^ORCID,Cui Gangfeng¹^ORCID,Luo Jiahui²^ORCID,Chang Xiaojun³^ORCID,Yao Lina⁴^ORCID

Affiliation:

1. School of Computer Science and Engineering, University of New South Wales, Australia

2. School of Computing and Information Systems, University of Melbourne, Australia

3. Faculty of Engineering and Information Technology, University of Technology Sydney, Australia

4. Data 61, CSIRO and School of Computer Science and Engineering, University of New South Wales, Australia

Abstract

As a fundamental aspect of human life, two-person interactions contain meaningful information about people’s activities, relationships, and social settings. Human action recognition serves as the foundation for many smart applications, with a strong focus on personal privacy. However, recognizing two-person interactions poses more challenges due to increased body occlusion and overlap compared to single-person actions. In this article, we propose a point cloud-based network named Two-stream Multi-level Dynamic Point Transformer for two-person interaction recognition. Our model addresses the challenge of recognizing two-person interactions by incorporating local-region spatial information, appearance information, and motion information. To achieve this, we introduce a designed frame selection method named Interval Frame Sampling (IFS), which efficiently samples frames from videos, capturing more discriminative information in a relatively short processing time. Subsequently, a frame features learning module and a two-stream multi-level feature aggregation module extract global and partial features from the sampled frames, effectively representing the local-region spatial information, appearance information, and motion information related to the interactions. Finally, we apply a transformer to perform self-attention on the learned features for the final classification. Extensive experiments are conducted on two large-scale datasets, the interaction subsets of NTU RGB+D 60 and NTU RGB+D 120. The results show that our network outperforms state-of-the-art approaches in most standard evaluation settings.

Publisher

Association for Computing Machinery (ACM)

Link

https://dl.acm.org/doi/pdf/10.1145/3639470

Reference51 articles.

1. Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event;Bertasius Gedas,2021

2. Hierarchical transfer learning for online recognition of compound actions

3. Shian-Yu Chiu, Kun-Ru Wu, and Yu-Chee Tseng. 2021. Two-person mutual action recognition using joint dynamics and coordinate transformation. In CAIP 2021: Proceedings of the 1st International Conference on AI for People: Towards Sustainable AI, CAIP 2021, 20-24 November 2021. European Alliance for Innovation, 56.

4. Part-wise Spatio-temporal Attention Driven CNN-based 3D Human Action Recognition

5. Alexey Dosovitskiy Lucas Beyer Alexander Kolesnikov Dirk Weissenborn Xiaohua Zhai Thomas Unterthiner Mostafa Dehghani Matthias Minderer Georg Heigold Sylvain Gelly Jakob Uszkoreit and Neil Houlsby. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. In 9th International Conference on Learning Representations (ICLR’21 Virtual Event Austria May 3-7 2021) OpenReview.net.