Attentional Composition Networks for Long-Tailed Human Action Recognition-Reference-Cited by-同舟云学术

Attentional Composition Networks for Long-Tailed Human Action Recognition

Published:2023-08-24 Issue:1 Volume:20 Page:1-18
ISSN:1551-6857
Container-title:ACM Transactions on Multimedia Computing, Communications, and Applications
language:en
Short-container-title:ACM Trans. Multimedia Comput. Commun. Appl.

Author:

Wang Haoran¹^ORCID,Wang Yajie¹^ORCID,Yu Baosheng²^ORCID,Zhan Yibing³^ORCID,Yuan Chunfeng⁴^ORCID,Yang Wankou⁵^ORCID

Affiliation:

1. Northeastern University, China

2. The University of Sydney, Australia

3. JD Explore Academy, China

4. Chinese Academy of Sciences, China

5. Southeast University, China

Abstract

The problem of long-tailed visual recognition has been receiving increasing research attention. However, the long-tailed distribution problem remains underexplored for video-based visual recognition. To address this issue, in this article we propose a compositional learning based solution for video-based human action recognition. Our method, named Attentional Composition Networks (ACN), first learns verb-like and preposition-like components, then shuffles these components to generate samples for the tail classes in the feature space to augment the data for the tail classes. Specifically, during training, we represent each action video by a graph that captures the spatial-temporal relations (edges) among detected human/object instances (nodes). Then, ACN utilizes the position information to decompose each action into a set of verb and preposition representations using the edge features in the graph. After that, the verb and preposition features from different videos are combined via an attention structure to synthesize feature representations for tail classes. This way, we can enrich the data for the tail classes and consequently improve the action recognition for these classes. To evaluate the compositional human action recognition, we further contribute a new human action recognition dataset, namely NEU-Interaction (NEU-I). Experimental results on both Something-Something V2 and the proposed NEU-I demonstrate the effectiveness of the proposed method for long-tailed, few-shot, and zero-shot problems in human action recognition. Source code and the NEU-I dataset are available at https://github.com/YajieW99/ACN .

Funder

Major Science and Technology Innovation 2030 “New Generation Artificial Intelligence” key project

Fundamental Research Funds for the Central Universities of China

National Nature Science Foundation of China

Publisher

Association for Computing Machinery (ACM)

Subject

Computer Networks and Communications,Hardware and Architecture

Link

https://dl.acm.org/doi/pdf/10.1145/3603253

Reference70 articles.

1. Amit Alfassy, Leonid Karlinsky, Amit Aides, Joseph Shtok, Sivan Harary, Rogerio Feris, Raja Giryes, and Alex M. Bronstein. 2019. LaSO: Label-set operations networks for multi-label few-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Los Alamitos, CA, 6548–6557.

2. Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Dan Klein. 2016. Neural module networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Los Alamitos, CA, 39–48.

3. Gedas Bertasius, Heng Wang, and Lorenzo Torresani. 2021. Is space-time attention all you need for video understanding? In Proceedings of the International Conference on Machine Learning. 813–824.

4. A systematic study of the class imbalance problem in convolutional neural networks

5. Jonathon Byrd and Zachary Lipton. 2019. What is the effect of importance weighting in deep learning? In Proceedings of the International Conference on Machine Learning. 872–881.