ACT Model for SO101 Robot

This is an Action Chunking Transformer (ACT) model trained for the SO101 robot using LeRobot. The model was trained on demonstration data collected from teleoperation sessions.

Model Details

Architecture

  • Model Type: Action Chunking Transformer (ACT)
  • Vision Backbone: ResNet18 with ImageNet pretrained weights
  • Transformer Configuration:
    • Hidden dimension: 512
    • Number of heads: 8
    • Encoder layers: 4
    • Decoder layers: 1
    • Feedforward dimension: 3200
  • VAE: Enabled with 32-dimensional latent space
  • Chunk Size: 50 steps
  • Action Steps: 15 steps per inference

Camera Setup

The model uses a dual-camera setup for robust perception:

  1. Wrist Camera (observation.images.wrist):

    • Resolution: 240×320 pixels
    • Position: Mounted on the robot's wrist
    • Purpose: Provides close-up, detailed view of manipulation tasks
    • Field of view: Narrow, focused on the immediate workspace
  2. Top Camera (observation.images.top):

    • Resolution: 480×640 pixels
    • Position: Mounted above the workspace
    • Purpose: Provides broader context and overview of the environment
    • Field of view: Wide, captures the entire workspace

Input/Output Specifications

Inputs:

  • Robot State: 6-dimensional joint positions
    • shoulder_pan.pos
    • shoulder_lift.pos
    • elbow_flex.pos
    • wrist_flex.pos
    • wrist_roll.pos
    • gripper.pos
  • Wrist Camera: RGB image (240×320×3)
  • Top Camera: RGB image (480×640×3)

Outputs:

  • Actions: 6-dimensional joint commands (same structure as state)

Training Details

Dataset

  • Source: r2owb0/so101-DS1
  • Episodes: 10 demonstration episodes
  • Total Frames: 5,990 frames
  • Frame Rate: 30 FPS
  • Robot Type: SO101 follower robot

Training Configuration

  • Training Steps: 25,000
  • Batch Size: 4
  • Learning Rate: 1e-5
  • Optimizer: AdamW with weight decay 1e-4
  • Validation Split: 10% of episodes
  • Seed: 1000

Data Augmentation

The model was trained with comprehensive image augmentation:

  • Brightness adjustment (0.8-1.2x)
  • Contrast adjustment (0.8-1.2x)
  • Saturation adjustment (0.5-1.5x)
  • Hue adjustment (±0.05)
  • Sharpness adjustment (0.5-1.5x)

Usage

Installation

pip install lerobot

Loading the Model

from lerobot.policies import ACTPolicy
from lerobot.configs.policies import ACTConfig

# Load the model
policy = ACTPolicy.from_pretrained("r2owb0/act1")

Evaluation

lerobot-eval \
    --policy.path=r2owb0/act1 \
    --env.type=your_env_type \
    --eval.n_episodes=10 \
    --eval.batch_size=10

Inference

import torch

# Prepare observation
observation = {
    "observation.state": torch.tensor([...]),  # 6D robot state
    "observation.images.wrist": torch.tensor([...]),  # 240x320x3 RGB
    "observation.images.top": torch.tensor([...])     # 480x640x3 RGB
}

# Get action
with torch.no_grad():
    action = policy.select_action(observation)

Hardware Requirements

Robot Setup

  • Robot: SO101 follower robot
  • Cameras:
    • Wrist-mounted camera (240×320 resolution)
    • Top-mounted camera (480×640 resolution)
  • Control: 6-DOF arm with gripper

Computing Requirements

  • GPU: CUDA-compatible GPU recommended
  • Memory: At least 4GB GPU memory
  • Storage: ~200MB for model weights

Performance Notes

  • The model uses action chunking, predicting 50 steps ahead but executing 15 steps at a time
  • Temporal ensembling is disabled for real-time inference
  • The model expects normalized inputs (mean/std normalization)
  • VAE is enabled for better representation learning

Limitations

  • Trained on a specific robot configuration (SO101)
  • Requires the exact camera setup described above
  • Performance may vary with different lighting conditions
  • Limited to the task domain covered in the training dataset

Citation

If you use this model in your research, please cite:

@misc{r2owb0_act1,
  author = {Robert},
  title = {ACT Model for SO101 Robot},
  year = {2024},
  publisher = {Hugging Face},
  url = {https://huggingface.co/r2owb0/act1}
}

License

This model is licensed under the Apache 2.0 License.

Downloads last month
3
Video Preview
loading

Model tree for r2owb0/act1

Finetuned
(1619)
this model

Dataset used to train r2owb0/act1