Discriminative Segment Focus Network for Fine-grained Video Action Recognition

Author:

Sun Baoli1ORCID,Ye Xinchen1ORCID,Yan Tiantian2ORCID,Wang Zhihui1ORCID,Li Haojie3ORCID,Wang Zhiyong4ORCID

Affiliation:

1. Dalian University of Technology, Dalian, China

2. Dalian University, Dalian, China

3. Shandong University of Science and Technology, Qingdao, China

4. The University of Sydney, Sydney, Australia

Abstract

Fine-grained video action recognition aims at identifying minor and discriminative variations among fine categories of actions. While many recent action recognition methods have been proposed to better model spatio-temporal representations, how to model the interactions among discriminative atomic actions to effectively characterize inter-class and intra-class variations has been neglected, which is vital for understanding fine-grained actions. In this work, we devise a Discriminative Segment Focus Network (DSFNet) to mine the discriminability of segment correlations and localize discriminative action-relevant segments for fine-grained video action recognition. Firstly, we propose a hierarchic correlation reasoning (HCR) module which explicitly establishes correlations between different segments at multiple temporal scales and enhances each segment by exploiting the correlations with other segments. Secondly, a discriminative segment focus (DSF) module is devised to localize the most action-relevant segments from the enhanced representations of HCR by enforcing the consistency between the discriminability and the classification confidence of a given segment with a consistency constraint. Finally, these localized segment representations are combined with the global action representation of the whole video for boosting final recognition. Extensive experimental results on two fine-grained action recognition datasets, i.e., FineGym and Diving48, and two action recognition datasets, i.e., Kinetics400 and Something-Something, demonstrate the effectiveness of our approach compared with the state-of-the-art methods.

Publisher

Association for Computing Machinery (ACM)

Reference75 articles.

1. Anurag Arnab Mostafa Dehghani Georg Heigold Chen Sun Mario Lucic and Cordelia Schmid. 2021. ViViT: A video vision transformer. In Proceedings of the IEEE/CVF ICCV October. 6816–6826.

2. Gedas Bertasius, Heng Wang, and Lorenzo Torresani. 2021. Is space-time attention all you need for video understanding?. In Proceedings of the ICML, July, Virtual Event. 813–824.

3. Joan Bruna, Wojciech Zaremba, Arthur Szlam, and Yann LeCun. 2014. Spectral networks and locally connected networks on graphs. In Proceedings of the 2nd International Conference on Learning Representations, ICLR 2014.

4. Depth personalization and streaming of stereoscopic sports videos;Calagari Kiana;ACM Trans. Multim. Comput. Commun. Appl.,2016

5. João Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? a new model and the kinetics dataset. In Proceedings of the IEEE CVPR, July. 4724–4733.

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3