Affiliation:
1. College of Mechatronics and Control Engineering, Shenzhen University, Shenzhen 518000, China
Abstract
Different from other computer vision tasks, action recognition needs to process larger-scale video data. How to extract and analyze the effective parts from a huge amount of video information is the main difficulty of action recognition technology. In recent years, due to the outstanding performance of Graph Convolutional Networks (GCN) in many fields, a new solution to the action recognition algorithm has emerged. However, in current GCN models, the constant physical adjacency matrix makes it difficult to mine synergistic relationships between key points that are not directly connected in physical space. Additionally, a simple time connection of skeleton data from different frames makes each frame in the video contribute equally to the recognition results, which increases the difficulty of distinguishing action stages. In this paper, the information extraction ability of the model has been optimized in the space domain and time domain, respectively. In the space domain, an Adjacency Matrix Generation (AMG) module, which can pre-analyze node sets and generate an adaptive adjacency matrix, has been proposed. The adaptive adjacency matrix can help the graph convolution model to extract the synergistic information between the key points that are crucial for recognition. In the time domain, the Time Domain Attention (TDA) mechanism has been designed to calculate the time-domain weight vector through double pooling channels and complete the weights of key point sequences. Furthermore, performance of the improved TDA-AMG-GCN modules has been verified on the NTU-RGB+D dataset. Its detection accuracy at the CS and CV divisions reached 84.5% and 89.8%, respectively, with an average level higher than other commonly used detection methods at present.
Funder
National Natural Science Foundation of China
Subject
Physics and Astronomy (miscellaneous),General Mathematics,Chemistry (miscellaneous),Computer Science (miscellaneous)
Reference23 articles.
1. Yan, S., Xiong, Y., and Lin, D. (2018, January 2–7). Spatial temporal graph convolutional networks for skeleton-based action recognition. Proceedings of the 32nd AAAI Conference on Artificial Intelligence, New Orleans, LA, USA.
2. Graves, A. (2012). Supervised Sequence Labelling with Recurrent Neural Networks, Springer.
3. Shahroudy, A., Liu, J., Ng, T., and Wang, G. (2016, January 27–30). NTU RGB+D: A large scale dataset for 3D human activity analysis. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA.
4. Lu, W.-L., and Little, J.J. (2006, January 7–9). Simultaneous tracking and action recognition using the PCA-HOG descriptor. Proceedings of the 3rd Canadian Conference on Computer and Robot Vision (CRV’06), Quebec, ON, Canada.
5. Thurau, C., and Hlavác, V. (2008, January 23–28). Pose primitive based human action recognition in videos or still images. Proceedings of the 2008 IEEE Conference on Computer Vision and Pattern Recognition, Anchorage, AK, USA.