SmolVLA Fine-tuned on SO-101 (Stratified Split)

87.66% success rate on SO-101 pick-and-place task through data-centric approach.

Model Description

This is a fine-tuned version of SmolVLA trained on the SO-101 pick-and-place dataset.

Key Achievement: Improved from 60.92% to 87.66% success rate (+44%) by implementing position-aware stratified data splitting instead of hyperparameter tuning.

Model Details

  • Model type: Vision-Language-Action (VLA) policy
  • Base model: lerobot/smolvla_base
  • Training data: 40 episodes (stratified across 5 cube positions)
  • Validation performance: 87.66% success rate (within 5% tolerance per joint)
  • Training/validation gap: 1.7x (healthy generalization)

Intended Use

This model is designed for:

  • Primary use: SO-101 robotic arm pick-and-place tasks
  • Research: Studying data-centric approaches in robot learning
  • Education: Understanding the impact of proper data splitting

Out-of-scope: This model is fine-tuned specifically for SO-101 pick-and-place. Performance on other tasks or robots is not guaranteed.

Performance

Success Rate (Within 5% of Joint Range)

Joint Success Rate
shoulder_pan 98.13%
shoulder_lift 82.91%
elbow_flex 83.86%
wrist_flex 80.38%
wrist_roll 95.69%
gripper 84.99%
Average 87.66%

Comparison to Initial Approach

Metric Sequential Split Stratified Split Improvement
Success Rate 60.92% 87.66% +44%
Train/Val Gap 5.0x 1.7x -66%
shoulder_pan 45.81% 98.13% +114%
wrist_roll 60.86% 95.69% +57%

Training Details

Training Data

  • Dataset: lerobot/svla_so101_pickplace
  • Split strategy: Position-aware stratified sampling
    • 40 training episodes (8 per cube position)
    • 10 validation episodes (2 per cube position)
  • Total episodes: 50 (5 positions × 10 episodes each)

Training Procedure

Key Innovation: Stratified splitting by cube position instead of sequential episode splitting.

# Stratified split ensures all 5 positions in train AND val
for position in range(5):
    position_episodes = episodes[position*10:(position+1)*10]
    train: 8 episodes per position (80%)
    val: 2 episodes per position (20%)

Training hyperparameters:

  • Base model: lerobot/smolvla_base (pretrained)
  • Training steps: 15,000
  • Batch size: 24
  • Weight decay: 0.001
  • Best checkpoint: Step 6000 (validation loss: 0.0360)

Data augmentation:

  • Color jitter (brightness, contrast, saturation, hue)
  • Training: Standard variance (±20%)
  • Validation: Shifted distribution (darker/higher contrast) for robustness testing

How to Use

Installation

pip install lerobot

Inference

from lerobot.common.policies.smolvla.modeling_smolvla import SmolVLAPolicy
from lerobot.common.datasets.lerobot_dataset import LeRobotDataset
import torch

# Load model
policy = SmolVLAPolicy.from_pretrained("your-username/smolvla-so101-stratified")
policy.eval()
policy.to("cuda")

# Load dataset for testing
dataset = LeRobotDataset("lerobot/svla_so101_pickplace")

# Get observation
obs = dataset[0]

# Prepare input
batch = {
    'observation.images.camera1': obs['observation.images.up'].unsqueeze(0),
    'observation.images.camera2': obs['observation.images.side'].unsqueeze(0),
    'observation.state': obs['observation.state'].unsqueeze(0),
}

# Predict actions
with torch.no_grad():
    actions = policy.predict_action_chunk(batch)  # (1, 50, 6)
    next_action = actions[0, 0, :]  # First action in sequence

Limitations and Bias

Limitations

  1. Task-specific: Trained only on SO-101 pick-and-place. May not generalize to other manipulation tasks.
  2. Single robot: Fine-tuned for SO-101 6-DOF arm. Performance on other robot embodiments not guaranteed.
  3. Limited scenarios: 5 cube positions in training. Novel positions far from training distribution may have lower success.
  4. Simulation-trained: Trained on recorded demonstrations. Real-world deployment may require additional adaptation.

Bias Considerations

  • Position bias: Model has most experience with positions 0-3 (8 examples each), slightly less with position 4.
  • Lighting conditions: Trained with specific augmentation ranges. Extreme lighting outside this range may degrade performance.

Key Lesson: Data Quality Over Model Tuning

This model demonstrates that proper data handling can have larger impact than hyperparameter optimization.

What didn't work:

  • Increasing weight decay (0.001 to 0.2): ~5% improvement
  • Adjusting augmentation: minimal impact
  • Training longer: made validation worse

What worked:

  • Stratified data splitting: 44% improvement

The initial 5x train/val gap wasn't overfitting - it was unfair evaluation on underrepresented data (validation set was only position 4, which had few training examples).

More Information

Downloads last month
174
Safetensors
Model size
0.5B params
Tensor type
F32
·
BF16
·
Video Preview
loading

Model tree for Sa74ll/smolvla_so101_pickandplace

Finetuned
(5993)
this model

Dataset used to train Sa74ll/smolvla_so101_pickandplace

Paper for Sa74ll/smolvla_so101_pickandplace

Evaluation results

  • Per-Joint Success Rate (5% tolerance) on SO-101 Pick & Place
    self-reported
    87.660