Abstract
Researchers are rapidly turning their focus to human pose estimation as a crucial area of computer vision. In light of the shortcomings of existing Transformer-based pose estimate methods when handling localized features, this work presents MAQT, an enhanced end-to-end method aimed at precise multi-human body pose estimation.To improve the localization of keypoints that are sensitive to scale changes, MAQT offers a Asym-Fusion block. Additionally, we design a new query strategy to optimize the initial selection of queries with Uncertainty-minimal Query Selection. This study combines two self-attention mechanisms in the decoding phase to more correctly understand and record the intricate relationships among keypoints. Based on experimental results on MS COCO using the CrowdPose dataset, MAQT performs better than current contemporary methods.