Multiscale knowledge distillation with attention based fusion for robust human activity recognition-Reference-Cited by-同舟云学术

Multiscale knowledge distillation with attention based fusion for robust human activity recognition

Published:2024-05-30 Issue:1 Volume:14 Page:
ISSN:2045-2322
Container-title:Scientific Reports
language:en
Short-container-title:Sci Rep

Author:

Yuan Zhaohui,Yang Zhengzhe,Ning Hao,Tang Xiangyang

Abstract

AbstractKnowledge distillation is an effective approach for training robust multi-modal machine learning models when synchronous multimodal data are unavailable. However, traditional knowledge distillation techniques have limitations in comprehensively transferring knowledge across modalities and models. This paper proposes a multiscale knowledge distillation framework to address these limitations. Specifically, we introduce a multiscale semantic graph mapping (SGM) loss function to enable more comprehensive knowledge transfer between teacher and student networks at multiple feature scales. We also design a fusion and tuning (FT) module to fully utilize correlations within and between different data types of the same modality when training teacher networks. Furthermore, we adopt transformer-based backbones to improve feature learning compared to traditional convolutional neural networks. We apply the proposed techniques to multimodal human activity recognition and compared with the baseline method, it improved by 2.31% and 0.29% on the MMAct and UTD-MHAD datasets. Ablation studies validate the necessity of each component.

Funder

Natural Science Foundation of Jiangxi Province

Publisher

Springer Science and Business Media LLC

Link

https://www.nature.com/articles/s41598-024-63195-5.pdf

Reference67 articles.

1. Hinton, G., Vinyals, O. & Dean, J. Distilling the knowledge in a neural network. Preprint at http://arxiv.org/abs/1503.02531 (2015).

2. Liu, Z. et al. Swin transformer: Hierarchical vision transformer using shifted windows. In Proc. of the IEEE/CVF international conference on computer vision, 10012–10022 (2021).

3. Liu, Z. et al. Video swin transformer. In Proc. of the IEEE/CVF conference on computer vision and pattern recognition, 3202–3211 (2022).

4. Kong, Q. et al. Mmact: A large-scale dataset for cross modal human action understanding. In Proc. of the IEEE/CVF International Conference on Computer Vision, 8658–8667 (2019).

5. Chen, C., Jafari, R. & Kehtarnavaz, N. Utd-mhad: A multimodal dataset for human action recognition utilizing a depth camera and a wearable inertial sensor. In 2015 IEEE International conference on image processing (ICIP), 168–172 (IEEE, 2015).

Cited by 1 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Reducing Model Complexity in Neural Networks by Using Pyramid Training Approaches;Applied Sciences;2024-07-05