Hierarchical dynamic depth projected difference images–based action recognition in videos with convolutional neural networks-Reference-Cited by-同舟云学术

Hierarchical dynamic depth projected difference images–based action recognition in videos with convolutional neural networks

Published:2019-01-01 Issue:1 Volume:16 Page:172988141882509
ISSN:1729-8814
Container-title:International Journal of Advanced Robotic Systems
language:en
Short-container-title:International Journal of Advanced Robotic Systems

Author:

Wu Hanbo¹,Ma Xin¹^ORCID,Li Yibin¹

Affiliation:

1. School of Control Science and Engineering, Shandong University, Jinan, China

Abstract

Temporal information plays a significant role in video-based human action recognition. How to effectively extract the spatial–temporal characteristics of actions in videos has always been a challenging problem. Most existing methods acquire spatial and temporal cues in videos individually. In this article, we propose a new effective representation for depth video sequences, called hierarchical dynamic depth projected difference images that can aggregate the action spatial and temporal information simultaneously at different temporal scales. We firstly project depth video sequences onto three orthogonal Cartesian views to capture the 3D shape and motion information of human actions. Hierarchical dynamic depth projected difference images are constructed with the rank pooling in each projected view to hierarchically encode the spatial–temporal motion dynamics in depth videos. Convolutional neural networks can automatically learn discriminative features from images and have been extended to video classification because of their superior performance. To verify the effectiveness of hierarchical dynamic depth projected difference images representation, we construct a hierarchical dynamic depth projected difference images–based action recognition framework where hierarchical dynamic depth projected difference images in three views are fed into three identical pretrained convolutional neural networks independently for finely retuning. We design three classification schemes in the framework and different schemes utilize different convolutional neural network layers to compare their effects on action recognition. Three views are combined to describe the actions more comprehensively in each classification scheme. The proposed framework is evaluated on three challenging public human action data sets. Experiments indicate that our method has better performance and can provide discriminative spatial–temporal information for human action recognition in depth videos.

Funder

National Natural Science Foundation of China

Publisher

SAGE Publications

Subject

Artificial Intelligence,Computer Science Applications,Software

Link

http://journals.sagepub.com/doi/pdf/10.1177/1729881418825093

Reference52 articles.

Cited by 13 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Spatio-Temporal Information Fusion and Filtration for Human Action Recognition;Symmetry;2023-12-08

2. Spatial and temporal information fusion for human action recognition via Center Boundary Balancing Multimodal Classifier;Journal of Visual Communication and Image Representation;2023-02

3. PointMapNet: Point Cloud Feature Map Network for 3D Human Action Recognition;Symmetry;2023-01-30

4. VirtualActionNet: A strong two-stream point cloud sequence network for human action recognition;Journal of Visual Communication and Image Representation;2022-11

5. Real-time human action recognition using raw depth video-based recurrent neural networks;Multimedia Tools and Applications;2022-10-28