SO-101 Ball-in-Cup Pi0.5 Policy

A fine-tuned Pi0.5 (Ο€β‚€.β‚…) Vision-Language-Action model for the ball-in-cup task using the SO-101 robot arm.

Task Description

Goal: Pick up an orange ball from the table and place it into a pink cup.

Robot: SO-101 - 6-DOF robot arm with gripper

Cameras: Dual camera setup (overhead + wrist-mounted)

Model Architecture

Pi0.5 is a Vision-Language-Action (VLA) model from Physical Intelligence:

Component Description
Vision Encoder SigLIP 400M - processes camera images
Language Model Gemma 2B - scene understanding & task grounding
Action Expert Flow Matching head - generates smooth action trajectories
Total Parameters ~3B

The model takes natural language instructions + camera images β†’ outputs continuous joint actions.

Training Details

Parameter Value
Base Model Pi0.5 (Physical Intelligence)
Dataset abdul004/so101_ball_in_cup_v5
Episodes 72 teleoperated demonstrations
Frames 25,045
Fine-tuning Steps 5,000
Hardware A100 80GB on RunPod
Training Time ~3-4 hours
Cost ~$6-8 USD
Framework OpenPi (JAX/Flax)

Inference Performance

JPEG Compression Optimization

We implemented JPEG compression to reduce network transfer time for remote inference:

Location Raw Images JPEG (Q80) Speedup
EU Spot 1448ms 375ms 3.9x
US On-Demand 600ms 270ms 2.2x
Metric Before After
Payload Size 1.8 MB 71 KB
Control Rate (US) 1.7 Hz 3.7 Hz
Compression Ratio - 25x

Architecture

[RunPod GPU Server]              [Robot Mac]
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Pi0.5 Model     │◄── WSS ────►│ run_pi05.py  β”‚
β”‚ (RTX 4090)      β”‚   JPEG      β”‚ (Robot ctrl) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Demo

With JPEG Compression (~270ms latency)

Evaluation Demo - JPEG Side-by-side: Overhead camera (left) + Wrist camera (right) - Smooth 3.7 Hz control

Without JPEG Compression (~600ms latency)

Evaluation Demo - Raw Side-by-side: Same task but with raw image transfer - 1.7 Hz control

Sample Evaluation

JPEG Compression (Fast)

Evaluation Composite - JPEG 5-frame composite: Start β†’ Approach β†’ Grasp β†’ Transport β†’ Final

Raw Images (Slow)

Evaluation Composite - Raw Same task without JPEG optimization

Usage

Server Setup (RunPod)

# Clone OpenPi fork with JPEG support
git clone https://github.com/abdulrahman004/openpi.git
cd openpi
uv sync

# Download checkpoint
uv run huggingface-cli download abdul004/pi05_so101_checkpoint \
    --include "4999/**" \
    --local-dir checkpoints/pi05_so101

# Start server
uv run scripts/serve_policy.py --port 8000 \
    policy:checkpoint \
    --policy.config=pi05_so101 \
    --policy.dir=checkpoints/pi05_so101/4999

Client (Robot Mac)

pip install openpi-client

# Run inference with JPEG compression
python run_pi05.py --server wss://YOUR-POD-8000.proxy.runpod.net

# Or without compression (slower)
python run_pi05.py --server wss://YOUR-POD-8000.proxy.runpod.net --no-jpeg

External Videos (Phone Capture)

Real-world demonstrations recorded externally during evaluation runs:

JPEG Compression (~270ms)

External Video - JPEG External phone recording showing smooth robot control with JPEG compression

Raw Images (~600ms)

External Video - Raw Same task without compression - noticeably slower/choppier control

Edge Retrieval (Out-of-Distribution)

External Video - Edge Ball placed at workspace edge - a position that appeared in <10% of training episodes

Comparison with ACT Policy

Trained on the same dataset:

Policy Architecture Inference Grasp Generalization
Pi0.5 VLA (3B params) Remote GPU βœ… βœ… Edge positions
ACT Transformer (25M) Local βœ… ❌ Center only

Edge Retrieval: Pi0.5 vs ACT

ACT failed at edge positions - the policy was only trained with ~72 episodes where the ball was mostly in the center/reachable area. When the ball was placed at the edge of the workspace, ACT would miss or fail to reach it entirely.

Pi0.5 succeeds at edge positions despite having the same training data. This demonstrates the power of VLA pre-training:

  1. SigLIP (vision encoder) was pre-trained on billions of images - understands "ball" and "edge" concepts generally
  2. Gemma (language model) provides semantic grounding - "pick up ball" applies regardless of position
  3. Action Expert learned smooth motion primitives from diverse robot arms during base model training

The base Pi0.5 model was trained on data from many different robot arms performing various tasks. This gives it a strong prior on reachable workspace and arm kinematics that ACT (trained from scratch) simply doesn't have.

Infrastructure Notes

Remote Inference Setup:

  • Server: RunPod RTX 4090 24GB (~$0.40/hr on-demand)
  • Client: Mac Mini M4 controlling SO-101 robot
  • Protocol: WebSocket with msgpack serialization
  • Optimization: JPEG compression reduces 1.8MB β†’ 71KB per inference

Known Issues:

  • RTX 4090 is borderline for memory - occasional OOM during model loading
  • US datacenters preferred (2x faster than EU for network transfer)
  • First inference takes 30-60s (JAX JIT compilation)

Limitations

  • Requires GPU server for inference (not yet optimized for edge deployment)
  • Sensitive to lighting changes
  • 72 training episodes may limit extreme edge case handling

Citation

@misc{so101_pi05_ball_in_cup,
  author = {Abdul},
  title = {SO-101 Ball-in-Cup Pi0.5 Fine-tuning},
  year = {2026},
  publisher = {Hugging Face},
  url = {https://huggingface.co/abdul004/pi05_so101_checkpoint}
}

Acknowledgments

Downloads last month

-

Downloads are not tracked for this model. How to track
Video Preview
loading

Dataset used to train abdul004/pi05_so101_checkpoint