Abstract
Based on a single-view scenario, our approach employs a unified Transformer framework featuring an enhanced multi-scale sparse attention mechanism to simultaneously perform three tasks: multi-pedestrian 3D pose estimation, tracking, and prediction. Initially, video data is processed to extract information, followed by training a video transformer to encode spatio-temporal features from multiple frames. This transformer decodes significant pose features from multi-person pose queries. These pose queries are then used for regression to predict multi-person pose trajectories and future movements in a single shot. To mitigate the challenges of occlusion and the complexity of pedestrian motion, we utilize a backbone network to extract features and implement an improved multi-scale spatio-temporal attention mechanism. This mechanism aggregates spatio-temporal information from multiple frames at various scales and captures long-term interactions. The backbone network excels at extracting detailed features from video data, while the multi-scale spatio-temporal attention mechanism, with its compact parameters, ensures a balance between efficiency and accuracy. Consequently, the integration of these components enhances prediction accuracy without excessively increasing model parameters.