SKELTER: unsupervised skeleton action denoising and recognition using transformers-Reference-Cited by-同舟云学术

SKELTER: unsupervised skeleton action denoising and recognition using transformers

Published:2023-08-23 Issue: Volume:5 Page:
ISSN:2624-9898
Container-title:Frontiers in Computer Science
language:
Short-container-title:Front. Comput. Sci.

Author:

Paoletti Giancarlo,Beyan Cigdem,Del Bue Alessio

Abstract

Unsupervised Human Action Recognition (U-HAR) methods currently leverage large-scale datasets of human poses to solve this challenging problem. As most of the approaches are dedicated to reaching the best recognition accuracies, no attention has been put into analyzing the resilience of such methods given perturbed data, a likely occurrence in real in-the-wild testing scenarios. Our first contribution is to systematically validate the decrease in performance of current U-HAR state-of-the-art using perturbed or altered data (e.g., obtained by removing some skeletal joints, rotating the entire pose, and injecting geometrical aberrations). Then, we propose a novel framework based on a transformer encoder–decoder with remarkable de-noising capabilities to counter such perturbations effectively. Moreover, we also present additional losses to have robust representations against rotation variances and provide temporal motion consistency. Our model, SKELTER, shows limited drops in performance when skeleton noise is present compared with previous approaches, favoring its use in challenging in-the-wild settings.

Publisher

Frontiers Media SA

Subject

Computer Science Applications,Computer Vision and Pattern Recognition,Human-Computer Interaction,Computer Science (miscellaneous)

Reference73 articles.

1. Learning local feature descriptors with triplets and shallow convolutional neural networks;Balntas;BMVC,2016

2. Coding kendall's shape trajectories for 3D action recognition;Ben Tanfous;Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,2018

3. Is space-time attention all you need for video understanding?;Bertasius;ICML,2021

4. Modeling multiple temporal scales of full-body movements for emotion classification;Beyan;IEEE Trans. Affect. Comput.,2021

5. Language models are few-shot learners;Brown;Advances in Neural Information Processing Systems 33,2020