Affiliation:
1. CBSR 8 NLPR, Institute of Automation, Chinese Academy of Sciences
2. Macau University of Science and Technology
3. School of Engineering Science, University of Chinese Academy of Sciences
Abstract
In this article, we focus on isolated gesture recognition and explore different modalities by involving RGB stream, depth stream, and saliency stream for inspection. Our goal is to push the boundary of this realm even further by proposing a unified framework that exploits the advantages of multi-modality fusion. Specifically, a spatial-temporal network architecture based on consensus-voting has been proposed to explicitly model the long-term structure of the video sequence and to reduce estimation variance when confronted with comprehensive inter-class variations. In addition, a three-dimensional depth-saliency convolutional network is aggregated in parallel to capture subtle motion characteristics. Extensive experiments are done to analyze the performance of each component and our proposed approach achieves the best results on two public benchmarks, ChaLearn IsoGD and RGBD-HuDaAct, outperforming the closest competitor by a margin of over 10% and 15%, respectively. Our project and codes will be released at https://davidsonic.github.io/index/acm_tomm_2017.html.
Funder
NVIDIA GPU donation program and AuthenMetric R8D Funds
National Key Research and Development Plan
Chinese National Natural Science Foundation Projects
Science and Technology Development Fund of Macau
Publisher
Association for Computing Machinery (ACM)
Subject
Computer Networks and Communications,Hardware and Architecture
Cited by
41 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献