SDPose: Exploiting Diffusion Priors for Out-of-Domain and Robust Pose Estimation (WholeBody - 133 Keypoints)
Model Description
SDPose is a state-of-the-art human pose estimation model that leverages the powerful visual priors from Stable Diffusion to achieve exceptional performance on out-of-distribution (OOD) scenarios. This model variant estimates 133 wholebody keypoints, including body, hands, face, feet.
Model Architecture
SDPose employs a U-Net backbone initialized with Stable Diffusion v2 weights, combined with a specialized heatmap head for keypoint prediction. The model operates in a top-down manner:
- Person Detection: Detect human bounding boxes using an object detector (e.g., YOLO11-x)
- Pose Estimation: Crop and estimate 17 body keypoints for each detected person
- Heatmap Generation: Produce confidence heatmaps for precise keypoint estimation
Model Specifications:
- Backbone: Stable Diffusion v2 U-Net (fine-tuned; minimal architectural changes)
- Head: Custom heatmap prediction head
- Input Resolution: 1024Γ768 (HΓW)
- Output: 133 keypoint heatmaps + coordinates with confidence scores
- Framework: MMPose
Supported Keypoints (COCO Wholebody Format)
The model predicts 133 body keypoints following the COCO Wholebody keypoint format.
Intended Use
- Human pose estimation in natural images
- Pose estimation in artistic and stylized domains (paintings, anime, sketches)
- Animation and video pose tracking
- Cross-domain pose analysis and research
- Applications requiring robust pose estimation under distribution shifts
How to Use
Installation
# Clone the repository
git clone https://github.com/t-s-liang/SDPose-OOD.git
cd SDPose-OOD
# Install dependencies
pip install -r requirements.txt
# Download YOLO11-x for human detection
wget https://github.com/ultralytics/assets/releases/download/v8.3.0/yolo11x.pt -P models/
# Launch Gradio interface
cd gradio_app
bash launch_gradio.sh
Training Data
Datasets
Trained exclusively on COCO-2017 train2017 (no extra data).
- COCO-Wholebody (Common Objects in Context): 200K+ images with 133 wholebody keypoints
Preprocessing
- Images are resized and cropped to 1024Γ768 resolution
- Augmentation: random horizontal flip, half-body & bbox transforms, UDP affine; Albumentations (Gaussian/Median blur, coarse dropout).
- Heatmaps: UDP codec (MMPose style).
Comparison with Baselines
SDPose significantly outperforms traditional pose estimation models (e.g., Sapiens) on out-of-distribution benchmarks while maintaining competitive performance on in-domain data.
See our paper for comprehensive evaluation results.
Citation
If you use SDPose in your research, please cite our paper:
@misc{liang2025sdposeexploitingdiffusionpriors,
title={SDPose: Exploiting Diffusion Priors for Out-of-Domain and Robust Pose Estimation},
author={Shuang Liang and Jing He and Chuanmeizhi Wang and Lejun Liao and Guo Zhang and Yingcong Chen and Yuan Yuan},
year={2025},
eprint={2509.24980},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2509.24980},
}
License
This model is released under the MIT License.
Additional Resources
- π Project Website: https://t-s-liang.github.io/SDPose
- π Paper: arXiv:2509.24980
- π» Code Repository: GitHub
- π€ Demo: HuggingFace Space
- π§ Contact: [email protected]
β Star us on GitHub β it motivates us a lot!