Affiliation:
1. School of Automation, Beijing Institute of Technology, Beijing 100081, China
Abstract
Recently, capsule networks have emerged as a novel neural network architecture for human motion recognition owing to their enhanced interpretability compared to traditional deep learning networks. However, the characteristic features of human motion are often distributed across distinct spatial dimensions and existing capsule networks struggle to independently extract and combine features across multiple spatial dimensions. In this paper, we propose a new multi-channel capsule network architecture that extracts feature capsules in different spatial dimensions, generates a multi-channel capsule chain with independent routing within each channel, and culminates in the aggregation of information from capsules in different channels to activate categories. The proposed structure endows the network with the capability to independently cluster interpretable features within different channels; aggregates features across channels during classification, thereby enhancing classification accuracy and robustness; and also presents the potential for mining interpretable primitives within individual channels. Experimental comparisons with several existing capsule network structures demonstrate the superior performance of the proposed architecture. Furthermore, in contrast to previous studies that vaguely discussed the interpretability of capsule networks, we include additional visual experiments that illustrate the interpretability of the proposed network structure in practical scenarios.
Funder
Key Technology Research and Demonstration of National Scientific Training Base Construction of China
Subject
Electrical and Electronic Engineering,Computer Networks and Communications,Hardware and Architecture,Signal Processing,Control and Systems Engineering
Reference25 articles.
1. A survey of content-aware video analysis for sports;Shih;IEEE Trans. Circuits Syst. Video Technol.,2018
2. Sequential deep trajectory descriptor for action recognition with three-stream CNN;Shi;IEEE Trans. Multimed.,2017
3. Simonyan, K., and Zisserman, A. (2014, January 8–13). Two-stream convolutional networks for action recognition in videos. Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada.
4. Wang, H., and Schmid, C. (2013, January 3–6). Action recognition with improved trajectories. Proceedings of the IEEE International Conference on Computer Vision, Sydney, Australia.
5. He, K., Zhang, X., Ren, S., and Sun, J. (2016, January 27–30). Deep residual learning for image recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA.