Author:
Zhang Cheng,Zhong Jianqi,Cao Wenming,Ji Jianhua
Abstract
Unsupervised action recognition based on spatiotemporal fusion feature extraction has attracted much attention in recent years. However, existing methods still have several limitations: (1) The long-term dependence relationship is not effectively extracted at the time level. (2) The high-order motion relationship between non-adjacent nodes is not effectively captured at the spatial level. (3) The model complexity is too high when the cascade layer input sequence is long, or there are many key points. To solve these problems, a Multiple Distilling-based spatial-temporal attention (MD-STA) networks is proposed in this paper. This model can extract temporal and spatial features respectively and fuse them. Specifically, we first propose a Screening Self-attention (SSA) module; this module can find long-term dependencies in distant frames and high-order motion patterns between non-adjacent nodes in a single frame through a sparse metric on dot product pairs. Then, we propose the Frames and Keypoint-Distilling (FKD) module, which uses extraction operations to halve the input of the cascade layer to eliminate invalid key points and time frame features, thus reducing time and memory complexity. Finally, the Dim-reduction Fusion (DRF) module is proposed to reduce the dimension of existing features to further eliminate redundancy. Numerous experiments were conducted on three distinct datasets: NTU-60, NTU-120, and UWA3D, showing that MD-STA achieves state-of-the-art standards in skeleton-based unsupervised action recognition.
Subject
Artificial Intelligence,Computer Vision and Pattern Recognition,Theoretical Computer Science
Reference47 articles.
1. I know the relationships: Zero-shot action recognition via two-stream graph convolutional networks and knowledge graphs;Gao;Proceedings of the AAAI conference on artificial intelligence,2019
2. W2vv++ fully deep learning for ad-hoc video search;Li;Proceedings of the 27th ACM international conference on multimedia,2019
3. Dual encoding for video retrieval by text;Dong;IEEE Transactions on Pattern Analysis and Machine Intelligence,2021
4. N. Zheng, J. Wen, R. Liu, L. Long, J. Dai and Z. Gong, Unsupervised representation learning with long-term dynamics for skeleton based action recognition, in: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32, 2018.
5. Unsupervised 3d human pose representation with viewpoint and pose disentanglement;Nie;Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XIX 16,2020