Abstract
Action recognition requires the accurate analysis of action elements in the form of a video clip and a properly ordered sequence of the elements. To solve the two sub-problems, it is necessary to learn both spatio-temporal information and the temporal relationship between different action elements. Existing convolutional neural network (CNN)-based action recognition methods have focused on learning only spatial or temporal information without considering the temporal relation between action elements. In this paper, we create short-term pixel-difference images from the input video, and take the difference images as an input to a bidirectional exponential moving average sub-network to analyze the action elements and their temporal relations. The proposed method consists of: (i) generation of RGB and differential images, (ii) extraction of deep feature maps using an image classification sub-network, (iii) weight assignment to extracted feature maps using a bidirectional, exponential, moving average sub-network, and (iv) late fusion with a three-dimensional convolutional (C3D) sub-network to improve the accuracy of action recognition. Experimental results show that the proposed method achieves a higher performance level than existing baseline methods. In addition, the proposed action recognition network takes only 0.075 seconds per action class, which guarantees various high-speed or real-time applications, such as abnormal action classification, human–computer interaction, and intelligent visual surveillance.
Funder
Institute for Information & Communica248 tions Technology Promotion
Subject
Fluid Flow and Transfer Processes,Computer Science Applications,Process Chemistry and Technology,General Engineering,Instrumentation,General Materials Science
Cited by
4 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献