Abstract
RGB and depth modalities contain more abundant and interactive information, and convolutional neural networks (ConvNets) based on multi-modal data have achieved successful progress in action recognition. Due to the limitation of a single stream, it is difficult to improve recognition performance by learning multi-modal interactive features. Inspired by the multi-stream learning mechanism and spatial-temporal information representation methods, we construct dynamic images by using the rank pooling method and design an interactive learning dual-ConvNet (ILD-ConvNet) with a multiplexer module to improve action recognition performance. Built on the rank pooling method, the constructed visual dynamic images can capture the spatial-temporal information from entire RGB videos. We extend this method to depth sequences to obtain more abundant multi-modal spatial-temporal information as the inputs of the ConvNets. In addition, we design a dual ILD-ConvNet with multiplexer modules to jointly learn the interactive features of two-stream from RGB and depth modalities. The proposed recognition framework has been tested on two benchmark multi-modal datasets—NTU RGB + D 120 and PKU-MMD. The proposed ILD-ConvNet with a temporal segmentation mechanism achieves an accuracy of 86.9% and 89.4% for Cross-Subject (C-Sub) and Cross-Setup (C-Set) on NTU RGB + D 120, 92.0% and 93.1% for Cross-Subject (C-Sub) and Cross-View (C-View) on PKU-MMD, which are comparable with the state of the art. The experimental results shown that our proposed ILD-ConvNet with a multiplexer module can extract interactive features from different modalities to enhance action recognition performance.
Funder
Ministry of Science and Technology of China
National Natural Science Foundation of China
Key Projects of Artificial Intelligence of High School in Guangdong Province
Innovation Project of High School in Guangdong Province
Dongguan Science and Technology Special Commissioner Project
Dongguan Social Development Science and Technology Project
Subject
General Mathematics,Engineering (miscellaneous),Computer Science (miscellaneous)
Reference52 articles.
1. Deep image-to-video adaptation and fusion networks for action recognition;Liu;IEEE Trans. Image Process. TIP,2020
2. Temporal reasoning graph for activity recognition;Zhang;IEEE Trans. Image Process. TIP,2020
3. A self-supervised gait encoding approach with locality-awareness for 3D skeleton based person re-identification;Rao;IEEE Trans. Pattern Anal. Mach. Intell. TPAMI,2021
4. BoMW: Bag of manifold words for one-shot learning gesture recognition from Kinect;Zhang;IEEE Trans. Circuits Syst. Vid. Technol. TCSVT,2017
5. Exploiting spatio-temporal representation for 3D human action recognition from depth map sequences;Ji;Knowl.-Based Syst.,2021