SDPose: Exploiting Diffusion Priors for Out-of-Domain and Robust Pose Estimation (WholeBody - 133 Keypoints)

Paper Project Page HuggingFace Demo License: MIT

Model Description

SDPose is a state-of-the-art human pose estimation model that leverages the powerful visual priors from Stable Diffusion to achieve exceptional performance on out-of-distribution (OOD) scenarios. This model variant estimates 133 wholebody keypoints, including body, hands, face, feet.

Model Architecture

SDPose employs a U-Net backbone initialized with Stable Diffusion v2 weights, combined with a specialized heatmap head for keypoint prediction. The model operates in a top-down manner:

  1. Person Detection: Detect human bounding boxes using an object detector (e.g., YOLO11-x)
  2. Pose Estimation: Crop and estimate 17 body keypoints for each detected person
  3. Heatmap Generation: Produce confidence heatmaps for precise keypoint estimation

Model Specifications:

  • Backbone: Stable Diffusion v2 U-Net (fine-tuned; minimal architectural changes)
  • Head: Custom heatmap prediction head
  • Input Resolution: 1024Γ—768 (HΓ—W)
  • Output: 133 keypoint heatmaps + coordinates with confidence scores
  • Framework: MMPose

Supported Keypoints (COCO Wholebody Format)

The model predicts 133 body keypoints following the COCO Wholebody keypoint format.

Intended Use

  • Human pose estimation in natural images
  • Pose estimation in artistic and stylized domains (paintings, anime, sketches)
  • Animation and video pose tracking
  • Cross-domain pose analysis and research
  • Applications requiring robust pose estimation under distribution shifts

How to Use

Installation

# Clone the repository
git clone https://github.com/t-s-liang/SDPose-OOD.git
cd SDPose-OOD

# Install dependencies
pip install -r requirements.txt
# Download YOLO11-x for human detection
wget https://github.com/ultralytics/assets/releases/download/v8.3.0/yolo11x.pt -P models/

# Launch Gradio interface
cd gradio_app
bash launch_gradio.sh

Training Data

Datasets

Trained exclusively on COCO-2017 train2017 (no extra data).

  • COCO-Wholebody (Common Objects in Context): 200K+ images with 133 wholebody keypoints

Preprocessing

  • Images are resized and cropped to 1024Γ—768 resolution
  • Augmentation: random horizontal flip, half-body & bbox transforms, UDP affine; Albumentations (Gaussian/Median blur, coarse dropout).
  • Heatmaps: UDP codec (MMPose style).

Comparison with Baselines

SDPose significantly outperforms traditional pose estimation models (e.g., Sapiens) on out-of-distribution benchmarks while maintaining competitive performance on in-domain data.

See our paper for comprehensive evaluation results.

Citation

If you use SDPose in your research, please cite our paper:

@misc{liang2025sdposeexploitingdiffusionpriors,
      title={SDPose: Exploiting Diffusion Priors for Out-of-Domain and Robust Pose Estimation}, 
      author={Shuang Liang and Jing He and Chuanmeizhi Wang and Lejun Liao and Guo Zhang and Yingcong Chen and Yuan Yuan},
      year={2025},
      eprint={2509.24980},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2509.24980}, 
}

License

This model is released under the MIT License.

Additional Resources


⭐ Star us on GitHub β€” it motivates us a lot!

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Space using teemosliang/SDPose-Wholebody 1