ACT Model for SO101 Robot

This is an Action Chunking Transformer (ACT) model trained for the SO101 robot using LeRobot. The model was trained on demonstration data collected from teleoperation sessions.

Model Details

Architecture

Model Type: Action Chunking Transformer (ACT)
Vision Backbone: ResNet18 with ImageNet pretrained weights
Transformer Configuration:
- Hidden dimension: 512
- Number of heads: 8
- Encoder layers: 4
- Decoder layers: 1
- Feedforward dimension: 3200
VAE: Enabled with 32-dimensional latent space
Chunk Size: 50 steps
Action Steps: 15 steps per inference

Camera Setup

The model uses a dual-camera setup for robust perception:

Wrist Camera (observation.images.wrist):
- Resolution: 240×320 pixels
- Position: Mounted on the robot's wrist
- Purpose: Provides close-up, detailed view of manipulation tasks
- Field of view: Narrow, focused on the immediate workspace
Top Camera (observation.images.top):
- Resolution: 480×640 pixels
- Position: Mounted above the workspace
- Purpose: Provides broader context and overview of the environment
- Field of view: Wide, captures the entire workspace

Input/Output Specifications

Inputs:

Robot State: 6-dimensional joint positions
- shoulder_pan.pos
- shoulder_lift.pos
- elbow_flex.pos
- wrist_flex.pos
- wrist_roll.pos
- gripper.pos
Wrist Camera: RGB image (240×320×3)
Top Camera: RGB image (480×640×3)

Outputs:

Actions: 6-dimensional joint commands (same structure as state)

Training Details

Dataset

Source: r2owb0/so101-DS1
Episodes: 10 demonstration episodes
Total Frames: 5,990 frames
Frame Rate: 30 FPS
Robot Type: SO101 follower robot

Training Configuration

Training Steps: 25,000
Batch Size: 4
Learning Rate: 1e-5
Optimizer: AdamW with weight decay 1e-4
Validation Split: 10% of episodes
Seed: 1000

Data Augmentation

The model was trained with comprehensive image augmentation:

Brightness adjustment (0.8-1.2x)
Contrast adjustment (0.8-1.2x)
Saturation adjustment (0.5-1.5x)
Hue adjustment (±0.05)
Sharpness adjustment (0.5-1.5x)

Usage

Installation

pip install lerobot

Loading the Model

from lerobot.policies import ACTPolicy
from lerobot.configs.policies import ACTConfig

# Load the model
policy = ACTPolicy.from_pretrained("r2owb0/act1")

Evaluation

lerobot-eval \
    --policy.path=r2owb0/act1 \
    --env.type=your_env_type \
    --eval.n_episodes=10 \
    --eval.batch_size=10

Inference

import torch

# Prepare observation
observation = {
    "observation.state": torch.tensor([...]),  # 6D robot state
    "observation.images.wrist": torch.tensor([...]),  # 240x320x3 RGB
    "observation.images.top": torch.tensor([...])     # 480x640x3 RGB
}

# Get action
with torch.no_grad():
    action = policy.select_action(observation)

Hardware Requirements

Robot Setup

Robot: SO101 follower robot
Cameras:
- Wrist-mounted camera (240×320 resolution)
- Top-mounted camera (480×640 resolution)
Control: 6-DOF arm with gripper

Computing Requirements

GPU: CUDA-compatible GPU recommended
Memory: At least 4GB GPU memory
Storage: ~200MB for model weights

Performance Notes

The model uses action chunking, predicting 50 steps ahead but executing 15 steps at a time
Temporal ensembling is disabled for real-time inference
The model expects normalized inputs (mean/std normalization)
VAE is enabled for better representation learning

Limitations

Trained on a specific robot configuration (SO101)
Requires the exact camera setup described above
Performance may vary with different lighting conditions
Limited to the task domain covered in the training dataset

Citation

If you use this model in your research, please cite:

@misc{r2owb0_act1,
  author = {Robert},
  title = {ACT Model for SO101 Robot},
  year = {2024},
  publisher = {Hugging Face},
  url = {https://huggingface.co/r2owb0/act1}
}

License

This model is licensed under the Apache 2.0 License.

Downloads last month: 3

Video Preview

Robotics

Model tree for r2owb0/act1

Base model

lerobot/smolvla_base

Finetuned

(1619)

this model

r2owb0
/

act1