ACT Model for SO101 Robot
This is an Action Chunking Transformer (ACT) model trained for the SO101 robot using LeRobot. The model was trained on demonstration data collected from teleoperation sessions.
Model Details
Architecture
- Model Type: Action Chunking Transformer (ACT)
- Vision Backbone: ResNet18 with ImageNet pretrained weights
- Transformer Configuration:
- Hidden dimension: 512
- Number of heads: 8
- Encoder layers: 4
- Decoder layers: 1
- Feedforward dimension: 3200
- VAE: Enabled with 32-dimensional latent space
- Chunk Size: 50 steps
- Action Steps: 15 steps per inference
Camera Setup
The model uses a dual-camera setup for robust perception:
Wrist Camera (
observation.images.wrist):- Resolution: 240×320 pixels
- Position: Mounted on the robot's wrist
- Purpose: Provides close-up, detailed view of manipulation tasks
- Field of view: Narrow, focused on the immediate workspace
Top Camera (
observation.images.top):- Resolution: 480×640 pixels
- Position: Mounted above the workspace
- Purpose: Provides broader context and overview of the environment
- Field of view: Wide, captures the entire workspace
Input/Output Specifications
Inputs:
- Robot State: 6-dimensional joint positions
shoulder_pan.posshoulder_lift.poselbow_flex.poswrist_flex.poswrist_roll.posgripper.pos
- Wrist Camera: RGB image (240×320×3)
- Top Camera: RGB image (480×640×3)
Outputs:
- Actions: 6-dimensional joint commands (same structure as state)
Training Details
Dataset
- Source:
r2owb0/so101-DS1 - Episodes: 10 demonstration episodes
- Total Frames: 5,990 frames
- Frame Rate: 30 FPS
- Robot Type: SO101 follower robot
Training Configuration
- Training Steps: 25,000
- Batch Size: 4
- Learning Rate: 1e-5
- Optimizer: AdamW with weight decay 1e-4
- Validation Split: 10% of episodes
- Seed: 1000
Data Augmentation
The model was trained with comprehensive image augmentation:
- Brightness adjustment (0.8-1.2x)
- Contrast adjustment (0.8-1.2x)
- Saturation adjustment (0.5-1.5x)
- Hue adjustment (±0.05)
- Sharpness adjustment (0.5-1.5x)
Usage
Installation
pip install lerobot
Loading the Model
from lerobot.policies import ACTPolicy
from lerobot.configs.policies import ACTConfig
# Load the model
policy = ACTPolicy.from_pretrained("r2owb0/act1")
Evaluation
lerobot-eval \
--policy.path=r2owb0/act1 \
--env.type=your_env_type \
--eval.n_episodes=10 \
--eval.batch_size=10
Inference
import torch
# Prepare observation
observation = {
"observation.state": torch.tensor([...]), # 6D robot state
"observation.images.wrist": torch.tensor([...]), # 240x320x3 RGB
"observation.images.top": torch.tensor([...]) # 480x640x3 RGB
}
# Get action
with torch.no_grad():
action = policy.select_action(observation)
Hardware Requirements
Robot Setup
- Robot: SO101 follower robot
- Cameras:
- Wrist-mounted camera (240×320 resolution)
- Top-mounted camera (480×640 resolution)
- Control: 6-DOF arm with gripper
Computing Requirements
- GPU: CUDA-compatible GPU recommended
- Memory: At least 4GB GPU memory
- Storage: ~200MB for model weights
Performance Notes
- The model uses action chunking, predicting 50 steps ahead but executing 15 steps at a time
- Temporal ensembling is disabled for real-time inference
- The model expects normalized inputs (mean/std normalization)
- VAE is enabled for better representation learning
Limitations
- Trained on a specific robot configuration (SO101)
- Requires the exact camera setup described above
- Performance may vary with different lighting conditions
- Limited to the task domain covered in the training dataset
Citation
If you use this model in your research, please cite:
@misc{r2owb0_act1,
author = {Robert},
title = {ACT Model for SO101 Robot},
year = {2024},
publisher = {Hugging Face},
url = {https://huggingface.co/r2owb0/act1}
}
License
This model is licensed under the Apache 2.0 License.
- Downloads last month
- 3
Model tree for r2owb0/act1
Base model
lerobot/smolvla_base