SO-101 Ball-in-Cup Pi0.5 Policy

A fine-tuned Pi0.5 (π₀.₅) Vision-Language-Action model for the ball-in-cup task using the SO-101 robot arm.

Task Description

Goal: Pick up an orange ball from the table and place it into a pink cup.

Robot: SO-101 - 6-DOF robot arm with gripper

Cameras: Dual camera setup (overhead + wrist-mounted)

Model Architecture

Pi0.5 is a Vision-Language-Action (VLA) model from Physical Intelligence:

Component	Description
Vision Encoder	SigLIP 400M - processes camera images
Language Model	Gemma 2B - scene understanding & task grounding
Action Expert	Flow Matching head - generates smooth action trajectories
Total Parameters	~3B

The model takes natural language instructions + camera images → outputs continuous joint actions.

Training Details

Parameter	Value
Base Model	Pi0.5 (Physical Intelligence)
Dataset	abdul004/so101_ball_in_cup_v5
Episodes	72 teleoperated demonstrations
Frames	25,045
Fine-tuning Steps	5,000
Hardware	A100 80GB on RunPod
Training Time	~3-4 hours
Cost	~$6-8 USD
Framework	OpenPi (JAX/Flax)

Inference Performance

JPEG Compression Optimization

We implemented JPEG compression to reduce network transfer time for remote inference:

Location	Raw Images	JPEG (Q80)	Speedup
EU Spot	1448ms	375ms	3.9x
US On-Demand	600ms	270ms	2.2x

Metric	Before	After
Payload Size	1.8 MB	71 KB
Control Rate (US)	1.7 Hz	3.7 Hz
Compression Ratio	-	25x

Architecture

[RunPod GPU Server]              [Robot Mac]
┌─────────────────┐              ┌──────────────┐
│ Pi0.5 Model     │◄── WSS ────►│ run_pi05.py  │
│ (RTX 4090)      │   JPEG      │ (Robot ctrl) │
└─────────────────┘              └──────────────┘

Demo

With JPEG Compression (~270ms latency)

Side-by-side: Overhead camera (left) + Wrist camera (right) - Smooth 3.7 Hz control

Without JPEG Compression (~600ms latency)

Side-by-side: Same task but with raw image transfer - 1.7 Hz control

Sample Evaluation

JPEG Compression (Fast)

5-frame composite: Start → Approach → Grasp → Transport → Final

Raw Images (Slow)

Same task without JPEG optimization

Usage

Server Setup (RunPod)

# Clone OpenPi fork with JPEG support
git clone https://github.com/abdulrahman004/openpi.git
cd openpi
uv sync

# Download checkpoint
uv run huggingface-cli download abdul004/pi05_so101_checkpoint \
    --include "4999/**" \
    --local-dir checkpoints/pi05_so101

# Start server
uv run scripts/serve_policy.py --port 8000 \
    policy:checkpoint \
    --policy.config=pi05_so101 \
    --policy.dir=checkpoints/pi05_so101/4999

Client (Robot Mac)

pip install openpi-client

# Run inference with JPEG compression
python run_pi05.py --server wss://YOUR-POD-8000.proxy.runpod.net

# Or without compression (slower)
python run_pi05.py --server wss://YOUR-POD-8000.proxy.runpod.net --no-jpeg

External Videos (Phone Capture)

Real-world demonstrations recorded externally during evaluation runs:

JPEG Compression (~270ms)

External phone recording showing smooth robot control with JPEG compression

Raw Images (~600ms)

Same task without compression - noticeably slower/choppier control

Edge Retrieval (Out-of-Distribution)

Ball placed at workspace edge - a position that appeared in <10% of training episodes

Comparison with ACT Policy

Trained on the same dataset:

Policy	Architecture	Inference	Grasp	Generalization
Pi0.5	VLA (3B params)	Remote GPU	✅	✅ Edge positions
ACT	Transformer (25M)	Local	✅	❌ Center only

Edge Retrieval: Pi0.5 vs ACT

ACT failed at edge positions - the policy was only trained with ~72 episodes where the ball was mostly in the center/reachable area. When the ball was placed at the edge of the workspace, ACT would miss or fail to reach it entirely.

Pi0.5 succeeds at edge positions despite having the same training data. This demonstrates the power of VLA pre-training:

SigLIP (vision encoder) was pre-trained on billions of images - understands "ball" and "edge" concepts generally
Gemma (language model) provides semantic grounding - "pick up ball" applies regardless of position
Action Expert learned smooth motion primitives from diverse robot arms during base model training

The base Pi0.5 model was trained on data from many different robot arms performing various tasks. This gives it a strong prior on reachable workspace and arm kinematics that ACT (trained from scratch) simply doesn't have.

Infrastructure Notes

Remote Inference Setup:

Server: RunPod RTX 4090 24GB (~$0.40/hr on-demand)
Client: Mac Mini M4 controlling SO-101 robot
Protocol: WebSocket with msgpack serialization
Optimization: JPEG compression reduces 1.8MB → 71KB per inference

Known Issues:

RTX 4090 is borderline for memory - occasional OOM during model loading
US datacenters preferred (2x faster than EU for network transfer)
First inference takes 30-60s (JAX JIT compilation)

Limitations

Requires GPU server for inference (not yet optimized for edge deployment)
Sensitive to lighting changes
72 training episodes may limit extreme edge case handling

Citation

@misc{so101_pi05_ball_in_cup,
  author = {Abdul},
  title = {SO-101 Ball-in-Cup Pi0.5 Fine-tuning},
  year = {2026},
  publisher = {Hugging Face},
  url = {https://huggingface.co/abdul004/pi05_so101_checkpoint}
}

Acknowledgments

Physical Intelligence for Pi0.5 and OpenPi
LeRobot by Hugging Face
SO-101 robot design community

Downloads last month: -; Downloads are not tracked for this model. How to track

Video Preview

Robotics

abdul004
/

pi05_so101_checkpoint