LASFormer: Light Transformer for Action Segmentation with Receptive Field-Guided Distillation and Action Relation Encoding
-
Published:2023-12-24
Issue:1
Volume:12
Page:57
-
ISSN:2227-7390
-
Container-title:Mathematics
-
language:en
-
Short-container-title:Mathematics
Affiliation:
1. School of Computer Science & Technology, Beijing Institute of Technology, Beijing 100081, China
Abstract
Transformer-based models for action segmentation have achieved high frame-wise accuracy against challenging benchmarks. However, they rely on multiple decoders and self-attention blocks for informative representations, whose huge computing and memory costs remain an obstacle to handling long video sequences and practical deployment. To address these issues, we design a light transformer model for the action segmentation task, named LASFormer, with a novel encoder–decoder structure based on three key designs. First, we propose a receptive field-guided distillation to realize mode reduction, which can overcome more generally the gap in semantic feature structure between the intermediate features by aggregated temporal dilation convolution (ATDC). Second, we propose a simplified implicit attention to replace self-attention to avoid its quadratic complexity. Third, we design an efficient action relation encoding module embedded after the decoder, where the temporal graph reasoning introduces an inductive bias that adjacent frames are more likely to belong to the same class of model global temporal relations, and the cross-model fusion structure integrates frame-level and segment-level temporal clues, which can avoid over-segmentation independent of multiple decoders, thus reducing further computational complexity. Extensive experiments have verified the effectiveness and efficiency of the framework. Against the challenging 50Salads, GTEA, and Breakfast benchmarks, LASFormer significantly outperforms the current state-of-the-art methods in accuracy, edit score, and F1 score.
Funder
Beijing Natural Science Foundation National Natural Science Foundation of China
Subject
General Mathematics,Engineering (miscellaneous),Computer Science (miscellaneous)
Reference50 articles.
1. Liu, D., Li, Q., Jiang, T., Wang, Y., Miao, R., Shan, F., and Li, Z. (2021, January 20–25). Towards Unified Surgical Skill Assessment. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Computer Vision Foundation, IEEE, Nashville, TN, USA. 2. Chen, M.H., Li, B., Bao, Y., AlRegib, G., and Kira, Z. (2020, January 13–19). Action Segmentation With Joint Self-Supervised Temporal Domain Adaptation. Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA. 3. Bottom-up improved multistage temporal convolutional network for action segmentation;Chen;Appl. Intell.,2022 4. Gao, S.H., Han, Q., Li, Z.Y., Peng, P., Wang, L., and Cheng, M.M. (2021, January 20–25). Global2Local: Efficient Structure Search for Video Action Segmentation. Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA. 5. Farha, Y.A., and Gall, J. (2019, January 15–19). MS-TCN: Multi-Stage Temporal Convolutional Network for Action Segmentation. Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA.
|
|