MVFNet: Multi-View Fusion Network for Efficient Video Recognition-Reference-Cited by-同舟云学术

MVFNet: Multi-View Fusion Network for Efficient Video Recognition

Published:2021-05-18 Issue:4 Volume:35 Page:2943-2951
ISSN:2374-3468
Container-title:Proceedings of the AAAI Conference on Artificial Intelligence
language:
Short-container-title:AAAI

Author:

Wu Wenhao,He Dongliang,Lin Tianwei,Li Fu,Gan Chuang,Ding Errui

Abstract

Conventionally, spatiotemporal modeling network and its complexity are the two most concentrated research topics in video action recognition. Existing state-of-the-art methods have achieved excellent accuracy regardless of the complexity meanwhile efficient spatiotemporal modeling solutions are slightly inferior in performance. In this paper, we attempt to acquire both efficiency and effectiveness simultaneously. First of all, besides traditionally treating H x W x T video frames as space-time signal (viewing from the Height-Width spatial plane), we propose to also model video from the other two Height-Time and Width-Time planes, to capture the dynamics of video thoroughly. Secondly, our model is designed based on 2D CNN backbones and model complexity is well kept in mind by design. Specifically, we introduce a novel multi-view fusion (MVF) module to exploit video dynamics using separable convolution for efficiency. It is a plug-and-play module and can be inserted into off-the-shelf 2D CNNs to form a simple yet effective model called MVFNet. Moreover, MVFNet can be thought of as a generalized video modeling framework and it can specialize to be existing methods such as C2D, SlowOnly, and TSM under different settings. Extensive experiments are conducted on popular benchmarks (i.e., Something-Something V1 & V2, Kinetics, UCF-101, and HMDB-51) to show its superiority. The proposed MVFNet can achieve state-of-the-art performance with 2D CNN's complexity.

Publisher

Association for the Advancement of Artificial Intelligence (AAAI)

Subject

General Medicine

Cited by 32 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Variable Temporal Length Training for Action Recognition CNNs;Sensors;2024-05-25

2. Short-Term Action Learning for Video Action Recognition;IEEE Access;2024

3. Multi-scale Motion Feature Integration for Action Recognition;2023 9th International Conference on Computer and Communications (ICCC);2023-12-08

4. Making TSM better: Preserving foundational philosophy for efficient action recognition;ICT Express;2023-12

5. Slow-Fast Time Parameter Aggregation Network for Class-Incremental Lip Reading;Proceedings of the 31st ACM International Conference on Multimedia;2023-10-26