Deep Insights into Convolutional Networks for Video Recognition-Reference-Cited by-同舟云学术

Deep Insights into Convolutional Networks for Video Recognition

Published:2019-10-29 Issue:2 Volume:128 Page:420-437
ISSN:0920-5691
Container-title:International Journal of Computer Vision
language:en
Short-container-title:Int J Comput Vis

Author:

Feichtenhofer Christoph^ORCID,Pinz Axel,Wildes Richard P.,Zisserman Andrew

Abstract

Abstract As the success of deep models has led to their deployment in all areas of computer vision, it is increasingly important to understand how these representations work and what they are capturing. In this paper, we shed light on deep spatiotemporal representations by visualizing the internal representation of models that have been trained to recognize actions in video. We visualize multiple two-stream architectures to show that local detectors for appearance and motion objects arise to form distributed representations for recognizing human actions. Key observations include the following. First, cross-stream fusion enables the learning of true spatiotemporal features rather than simply separate appearance and motion features. Second, the networks can learn local representations that are highly class specific, but also generic representations that can serve a range of classes. Third, throughout the hierarchy of the network, features become more abstract and show increasing invariance to aspects of the data that are unimportant to desired distinctions (e.g. motion patterns across various speeds). Fourth, visualizations can be used not only to shed light on learned representations, but also to reveal idiosyncrasies of training data and to explain failure cases of the system.

Funder

Graz University of Technology

Publisher

Springer Science and Business Media LLC

Subject

Artificial Intelligence,Computer Vision and Pattern Recognition,Software

Link

http://link.springer.com/content/pdf/10.1007/s11263-019-01225-w.pdf

Reference61 articles.

1. Animated Manuscript. http://feichtenhofer.github.io/pubs/Feichtenhofer_IJCV19.pdf .

2. Bau, D., Zhou, B., Khosla, A., Oliva, A., & Torralba, A. (2017). Network dissection: Quantifying interpretability of deep visual representations. In Proceedings of CVPR.

3. Carreira, J., & Zisserman, A. (2017). Quo vadis, action recognition? A new model and the kinetics dataset. In Proceedings of CVPR.

4. Dollar, P., Rabaud, V., Cottrell, G., & Belongie, S. (2005). Behavior recognition via sparse spatio-temporal features. In ICCV VS-PETS.

5. Dosovitskiy, A., & Brox, T. (2016). Generating images with perceptual similarity metrics based on deep networks. In NIPS.

Cited by 14 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Comprehensive study of driver behavior monitoring systems using computer vision and machine learning techniques;Journal of Big Data;2024-02-22

2. Transfer learning VGG for histopathological lung cancer image classification;AIP Conference Proceedings;2024

3. Mitigating and Evaluating Static Bias of Action Representations in the Background and the Foreground;2023 IEEE/CVF International Conference on Computer Vision (ICCV);2023-10-01

4. Leaping Into Memories: Space-Time Deep Feature Synthesis;2023 IEEE/CVF International Conference on Computer Vision (ICCV);2023-10-01

5. A critical study on the recent deep learning based semi-supervised video anomaly detection methods;Multimedia Tools and Applications;2023-08-19