Action Segmentation through Self-Supervised Video Features and Positional-Encoded Embeddings-Reference-Cited by-同舟云学术

Action Segmentation through Self-Supervised Video Features and Positional-Encoded Embeddings

Published:2024-08-16 Issue:9 Volume:20 Page:1-23
ISSN:1551-6857
Container-title:ACM Transactions on Multimedia Computing, Communications, and Applications
language:en
Short-container-title:ACM Trans. Multimedia Comput. Commun. Appl.

Author:

Marques Guilherme de A. P.¹,Boaro José Matheus C.¹,Busson Antonio José G.²,Guedes Alan L. V.¹,Duarte Julio Cesar³,Colcher Sérgio¹

Affiliation:

1. Pontifical Catholic University of Rio de Janeiro (PUC-Rio), Rio de Janeiro, Brazil

2. BTG Pactual, São Paulo, Brazil

3. Military Institute of Engineering (IME), Rio de Janeiro, Brazil

Abstract

Action segmentation consists of temporally segmenting a video and labeling each segmented interval with a specific action label. In this work, we propose a novel action segmentation method that requires no initial video analysis and no annotated data. Our proposal involves extracting features from videos using several pre-trained deep-learning models, including spatiotemporal and self-supervised methods. Data is then transformed using a positional encoder, and finally, a clustering algorithm is applied, where each produced cluster presumably corresponds to a different single and distinguishable action. For self-supervised features, we explored DINO, and for spatiotemporal features, we investigated I3D and SlowFast methods. Moreover, two different clustering algorithms (FINCH and KMeans) were investigated, and we also explored how varying the length of video snippets that generate the feature vectors affected the quality of the segmentation. Experiments show that our method produces competitive results on the Breakfast and INRIA Instructional Videos dataset benchmarks. Our best result was produced using a composition of self-supervised features generated by DINO, FINCH clustering, and positional encoding.

Funder

Air Force Office of Scientific Research

Publisher

Association for Computing Machinery (ACM)

Link

https://dl.acm.org/doi/pdf/10.1145/3649465

Reference65 articles.

1. A Perceptual Prediction Framework for Self Supervised Event Segmentation

2. Unsupervised Learning from Narrated Instruction Videos

3. Federico Becattini Tiberio Uricchio Lorenzo Seidenari Lamberto Ballan and Alberto Del Bimbo. 2020. Am I done? Predicting action progress in videos. ACM Trans. Multimedia Comput. Commun. Appl. 16 4 Article 119 (Dec.2020) 24 pages. DOI:10.1145/3402447

4. Weakly Supervised Action Labeling in Videos under Ordering Constraints

5. Emerging Properties in Self-Supervised Vision Transformers