HCMS: Hierarchical and Conditional Modality Selection for Efficient Video Recognition-Reference-Cited by-同舟云学术

HCMS: Hierarchical and Conditional Modality Selection for Efficient Video Recognition

Published:2022-12-02 Issue: Volume: Page:
ISSN:1551-6857
Container-title:ACM Transactions on Multimedia Computing, Communications, and Applications
language:en
Short-container-title:ACM Trans. Multimedia Comput. Commun. Appl.

Author:

Weng Zejia¹,Wu Zuxuan¹,Li Hengduo²,Chen Jingjing¹,Jiang Yu-Gang¹

Affiliation:

1. Shanghai Key Lab of Intelligent Info. Processing, School of CS, Fudan University, China

2. Department of Computer Science, University of Maryland, USA

Abstract

Videos are multimodal in nature. Conventional video recognition pipelines typically fuse multimodal features for improved performance. However, this is not only computationally expensive but also neglects the fact that different videos rely on different modalities for predictions. This paper introduces Hierarchical and Conditional Modality Selection (HCMS), a simple yet efficient multimodal learning framework for efficient video recognition. HCMS operates on a low-cost modality, i.e. , audio clues, by default, and dynamically decides on-the-fly whether to use computationally-expensive modalities, including appearance and motion clues, on a per-input basis. This is achieved by the collaboration of three LSTMs that are organized in a hierarchical manner. In particular, LSTMs that operate on high-cost modalities contain a gating module, which takes as inputs lower-level features and historical information to adaptively determine whether to activate its corresponding modality; otherwise it simply reuses historical information. We conduct extensive experiments on two large-scale video benchmarks, FCVID and ActivityNet, and the results demonstrate the proposed approach can effectively explore multimodal information for improved classification performance while requiring much less computation.

Publisher

Association for Computing Machinery (ACM)

Subject

Computer Networks and Communications,Hardware and Architecture

Link

https://dl.acm.org/doi/pdf/10.1145/3572776

Reference76 articles.

1. Samah Aloufi and Abdulmotaleb El Saddik . 2022. MMSUM digital twins: a multi-view multi-modality summarization framework for sporting events. ACM TOMM ( 2022 ). Samah Aloufi and Abdulmotaleb El Saddik. 2022. MMSUM digital twins: a multi-view multi-modality summarization framework for sporting events. ACM TOMM (2022).

2. Relja Arandjelovic and Andrew Zisserman. 2017. Look listen and learn. In ICCV. Relja Arandjelovic and Andrew Zisserman. 2017. Look listen and learn. In ICCV.

3. Tolga Bolukbasi Joseph Wang Ofer Dekel and Venkatesh Saligrama. 2017. Adaptive neural networks for fast test-time prediction. In ICML. Tolga Bolukbasi Joseph Wang Ofer Dekel and Venkatesh Saligrama. 2017. Adaptive neural networks for fast test-time prediction. In ICML.

4. Nicolas Carion Francisco Massa Gabriel Synnaeve Nicolas Usunier Alexander Kirillov and Sergey Zagoruyko. 2020. End-to-end object detection with transformers. In ECCV. Nicolas Carion Francisco Massa Gabriel Synnaeve Nicolas Usunier Alexander Kirillov and Sergey Zagoruyko. 2020. End-to-end object detection with transformers. In ECCV.

5. Joao Carreira and Andrew Zisserman. 2017. Quo vadis action recognition? a new model and the kinetics dataset. In CVPR. Joao Carreira and Andrew Zisserman. 2017. Quo vadis action recognition? a new model and the kinetics dataset. In CVPR.

Cited by 3 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Building an Open-Vocabulary Video CLIP Model With Better Architectures, Optimization and Data;IEEE Transactions on Pattern Analysis and Machine Intelligence;2024-07

2. Efficient Video Transformers via Spatial-temporal Token Merging for Action Recognition;ACM Transactions on Multimedia Computing, Communications, and Applications;2024-01-11

3. SMG: A System-Level Modality Gating Facility for Fast and Energy-Efficient Multimodal Computing;2023 IEEE Real-Time Systems Symposium (RTSS);2023-12-05