Whole-Body Keypoint and Skeleton Augmented RGB Networks for Video Action Recognition-Reference-Cited by-同舟云学术

Whole-Body Keypoint and Skeleton Augmented RGB Networks for Video Action Recognition

Published:2022-06-18 Issue:12 Volume:12 Page:6215
ISSN:2076-3417
Container-title:Applied Sciences
language:en
Short-container-title:Applied Sciences

Author:

Guo Zizhao,Ying Sancong

Abstract

Incorporating multi-modality data is an effective way to improve action recognition performance. Based on this idea, we investigate a new data modality in which Whole-Body Keypoint and Skeleton (WKS) labels are used to capture refined body information. Unlike directly aggregated multi-modality, we leverage distillation to adapt an RGB network to classify action with the feature-extraction ability of the WKS network, which is only fed with RGB clips. Inspired by the success of transformers for vision tasks, we design an architecture that takes advantage of both three-dimensional (3D) convolutional neural networks (CNNs) and the Swin transformer to extract spatiotemporal features, resulting in advanced performance. Furthermore, considering the unequal discrimination among clips of a video, we also present a new method for aggregating the clip-level classification results, further improving the performance. The experimental results demonstrate that our framework achieves advanced accuracy of 93.4% with only RGB input on the UCF-101 dataset.

Funder

the Major Special 427 Science and Technology Project of Sichuan Province

Publisher

MDPI AG

Subject

Fluid Flow and Transfer Processes,Computer Science Applications,Process Chemistry and Technology,General Engineering,Instrumentation,General Materials Science

Link

https://www.mdpi.com/2076-3417/12/12/6215/pdf

Reference52 articles.

1. 3D Convolutional Neural Networks for Human Action Recognition

2. Two-stream convolutional networks for action recognition in videos;Simonyan;arXiv,2014

3. Quo vadis, action recognition? a new model and the kinetics dataset;Carreira;Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,2017

4. Representing videos as discriminative sub-graphs for action recognition;Li;Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,2021

5. TDN: Temporal difference networks for efficient action recognition;Wang;Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,2021

Cited by 1 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Manifolds-Based Low-Rank Dictionary Pair Learning for Efficient Set-Based Video Recognition;Applied Sciences;2023-05-23