Affiliation:
1. Pontifical Catholic University of Rio de Janeiro (PUC-Rio), Rio de Janeiro, Brazil
2. BTG Pactual, São Paulo, Brazil
3. Military Institute of Engineering (IME), Rio de Janeiro, Brazil
Abstract
Action segmentation consists of temporally segmenting a video and labeling each segmented interval with a specific action label. In this work, we propose a novel action segmentation method that requires no initial video analysis and no annotated data. Our proposal involves extracting features from videos using several pre-trained deep-learning models, including spatiotemporal and self-supervised methods. Data is then transformed using a positional encoder, and finally, a clustering algorithm is applied, where each produced cluster presumably corresponds to a different single and distinguishable action. For self-supervised features, we explored DINO, and for spatiotemporal features, we investigated I3D and SlowFast methods. Moreover, two different clustering algorithms (FINCH and KMeans) were investigated, and we also explored how varying the length of video snippets that generate the feature vectors affected the quality of the segmentation. Experiments show that our method produces competitive results on the
Breakfast
and
INRIA Instructional Videos
dataset benchmarks. Our best result was produced using a composition of self-supervised features generated by DINO, FINCH clustering, and positional encoding.
Funder
Air Force Office of Scientific Research
Publisher
Association for Computing Machinery (ACM)