SO-101 Ball-in-Cup Pi0.5 Policy
A fine-tuned Pi0.5 (Οβ.β ) Vision-Language-Action model for the ball-in-cup task using the SO-101 robot arm.
Task Description
Goal: Pick up an orange ball from the table and place it into a pink cup.
Robot: SO-101 - 6-DOF robot arm with gripper
Cameras: Dual camera setup (overhead + wrist-mounted)
Model Architecture
Pi0.5 is a Vision-Language-Action (VLA) model from Physical Intelligence:
| Component | Description |
|---|---|
| Vision Encoder | SigLIP 400M - processes camera images |
| Language Model | Gemma 2B - scene understanding & task grounding |
| Action Expert | Flow Matching head - generates smooth action trajectories |
| Total Parameters | ~3B |
The model takes natural language instructions + camera images β outputs continuous joint actions.
Training Details
| Parameter | Value |
|---|---|
| Base Model | Pi0.5 (Physical Intelligence) |
| Dataset | abdul004/so101_ball_in_cup_v5 |
| Episodes | 72 teleoperated demonstrations |
| Frames | 25,045 |
| Fine-tuning Steps | 5,000 |
| Hardware | A100 80GB on RunPod |
| Training Time | ~3-4 hours |
| Cost | ~$6-8 USD |
| Framework | OpenPi (JAX/Flax) |
Inference Performance
JPEG Compression Optimization
We implemented JPEG compression to reduce network transfer time for remote inference:
| Location | Raw Images | JPEG (Q80) | Speedup |
|---|---|---|---|
| EU Spot | 1448ms | 375ms | 3.9x |
| US On-Demand | 600ms | 270ms | 2.2x |
| Metric | Before | After |
|---|---|---|
| Payload Size | 1.8 MB | 71 KB |
| Control Rate (US) | 1.7 Hz | 3.7 Hz |
| Compression Ratio | - | 25x |
Architecture
[RunPod GPU Server] [Robot Mac]
βββββββββββββββββββ ββββββββββββββββ
β Pi0.5 Model ββββ WSS βββββΊβ run_pi05.py β
β (RTX 4090) β JPEG β (Robot ctrl) β
βββββββββββββββββββ ββββββββββββββββ
Demo
With JPEG Compression (~270ms latency)
Side-by-side: Overhead camera (left) + Wrist camera (right) - Smooth 3.7 Hz control
Without JPEG Compression (~600ms latency)
Side-by-side: Same task but with raw image transfer - 1.7 Hz control
Sample Evaluation
JPEG Compression (Fast)
5-frame composite: Start β Approach β Grasp β Transport β Final
Raw Images (Slow)
Same task without JPEG optimization
Usage
Server Setup (RunPod)
# Clone OpenPi fork with JPEG support
git clone https://github.com/abdulrahman004/openpi.git
cd openpi
uv sync
# Download checkpoint
uv run huggingface-cli download abdul004/pi05_so101_checkpoint \
--include "4999/**" \
--local-dir checkpoints/pi05_so101
# Start server
uv run scripts/serve_policy.py --port 8000 \
policy:checkpoint \
--policy.config=pi05_so101 \
--policy.dir=checkpoints/pi05_so101/4999
Client (Robot Mac)
pip install openpi-client
# Run inference with JPEG compression
python run_pi05.py --server wss://YOUR-POD-8000.proxy.runpod.net
# Or without compression (slower)
python run_pi05.py --server wss://YOUR-POD-8000.proxy.runpod.net --no-jpeg
External Videos (Phone Capture)
Real-world demonstrations recorded externally during evaluation runs:
JPEG Compression (~270ms)
External phone recording showing smooth robot control with JPEG compression
Raw Images (~600ms)
Same task without compression - noticeably slower/choppier control
Edge Retrieval (Out-of-Distribution)
Ball placed at workspace edge - a position that appeared in <10% of training episodes
Comparison with ACT Policy
Trained on the same dataset:
| Policy | Architecture | Inference | Grasp | Generalization |
|---|---|---|---|---|
| Pi0.5 | VLA (3B params) | Remote GPU | β | β Edge positions |
| ACT | Transformer (25M) | Local | β | β Center only |
Edge Retrieval: Pi0.5 vs ACT
ACT failed at edge positions - the policy was only trained with ~72 episodes where the ball was mostly in the center/reachable area. When the ball was placed at the edge of the workspace, ACT would miss or fail to reach it entirely.
Pi0.5 succeeds at edge positions despite having the same training data. This demonstrates the power of VLA pre-training:
- SigLIP (vision encoder) was pre-trained on billions of images - understands "ball" and "edge" concepts generally
- Gemma (language model) provides semantic grounding - "pick up ball" applies regardless of position
- Action Expert learned smooth motion primitives from diverse robot arms during base model training
The base Pi0.5 model was trained on data from many different robot arms performing various tasks. This gives it a strong prior on reachable workspace and arm kinematics that ACT (trained from scratch) simply doesn't have.
Infrastructure Notes
Remote Inference Setup:
- Server: RunPod RTX 4090 24GB (~$0.40/hr on-demand)
- Client: Mac Mini M4 controlling SO-101 robot
- Protocol: WebSocket with msgpack serialization
- Optimization: JPEG compression reduces 1.8MB β 71KB per inference
Known Issues:
- RTX 4090 is borderline for memory - occasional OOM during model loading
- US datacenters preferred (2x faster than EU for network transfer)
- First inference takes 30-60s (JAX JIT compilation)
Limitations
- Requires GPU server for inference (not yet optimized for edge deployment)
- Sensitive to lighting changes
- 72 training episodes may limit extreme edge case handling
Citation
@misc{so101_pi05_ball_in_cup,
author = {Abdul},
title = {SO-101 Ball-in-Cup Pi0.5 Fine-tuning},
year = {2026},
publisher = {Hugging Face},
url = {https://huggingface.co/abdul004/pi05_so101_checkpoint}
}
Acknowledgments
- Physical Intelligence for Pi0.5 and OpenPi
- LeRobot by Hugging Face
- SO-101 robot design community