|
|
--- |
|
|
license: apache-2.0 |
|
|
library_name: lerobot |
|
|
pipeline_tag: robotics |
|
|
tags: |
|
|
- robotics |
|
|
- lerobot |
|
|
- act |
|
|
- imitation-learning |
|
|
- so101 |
|
|
model_name: act |
|
|
datasets: r2owb0/so101-DS1 |
|
|
base_model: lerobot/smolvla_base |
|
|
--- |
|
|
|
|
|
# ACT Model for SO101 Robot |
|
|
|
|
|
This is an Action Chunking Transformer (ACT) model trained for the SO101 robot using LeRobot. The model was trained on demonstration data collected from teleoperation sessions. |
|
|
|
|
|
## Model Details |
|
|
|
|
|
### Architecture |
|
|
- **Model Type**: Action Chunking Transformer (ACT) |
|
|
- **Vision Backbone**: ResNet18 with ImageNet pretrained weights |
|
|
- **Transformer Configuration**: |
|
|
- Hidden dimension: 512 |
|
|
- Number of heads: 8 |
|
|
- Encoder layers: 4 |
|
|
- Decoder layers: 1 |
|
|
- Feedforward dimension: 3200 |
|
|
- **VAE**: Enabled with 32-dimensional latent space |
|
|
- **Chunk Size**: 50 steps |
|
|
- **Action Steps**: 15 steps per inference |
|
|
|
|
|
### Camera Setup |
|
|
The model uses a **dual-camera setup** for robust perception: |
|
|
|
|
|
1. **Wrist Camera** (`observation.images.wrist`): |
|
|
- Resolution: 240×320 pixels |
|
|
- Position: Mounted on the robot's wrist |
|
|
- Purpose: Provides close-up, detailed view of manipulation tasks |
|
|
- Field of view: Narrow, focused on the immediate workspace |
|
|
|
|
|
2. **Top Camera** (`observation.images.top`): |
|
|
- Resolution: 480×640 pixels |
|
|
- Position: Mounted above the workspace |
|
|
- Purpose: Provides broader context and overview of the environment |
|
|
- Field of view: Wide, captures the entire workspace |
|
|
|
|
|
### Input/Output Specifications |
|
|
|
|
|
**Inputs:** |
|
|
- **Robot State**: 6-dimensional joint positions |
|
|
- `shoulder_pan.pos` |
|
|
- `shoulder_lift.pos` |
|
|
- `elbow_flex.pos` |
|
|
- `wrist_flex.pos` |
|
|
- `wrist_roll.pos` |
|
|
- `gripper.pos` |
|
|
- **Wrist Camera**: RGB image (240×320×3) |
|
|
- **Top Camera**: RGB image (480×640×3) |
|
|
|
|
|
**Outputs:** |
|
|
- **Actions**: 6-dimensional joint commands (same structure as state) |
|
|
|
|
|
## Training Details |
|
|
|
|
|
### Dataset |
|
|
- **Source**: `r2owb0/so101-DS1` |
|
|
- **Episodes**: 10 demonstration episodes |
|
|
- **Total Frames**: 5,990 frames |
|
|
- **Frame Rate**: 30 FPS |
|
|
- **Robot Type**: SO101 follower robot |
|
|
|
|
|
### Training Configuration |
|
|
- **Training Steps**: 25,000 |
|
|
- **Batch Size**: 4 |
|
|
- **Learning Rate**: 1e-5 |
|
|
- **Optimizer**: AdamW with weight decay 1e-4 |
|
|
- **Validation Split**: 10% of episodes |
|
|
- **Seed**: 1000 |
|
|
|
|
|
### Data Augmentation |
|
|
The model was trained with comprehensive image augmentation: |
|
|
- Brightness adjustment (0.8-1.2x) |
|
|
- Contrast adjustment (0.8-1.2x) |
|
|
- Saturation adjustment (0.5-1.5x) |
|
|
- Hue adjustment (±0.05) |
|
|
- Sharpness adjustment (0.5-1.5x) |
|
|
|
|
|
## Usage |
|
|
|
|
|
### Installation |
|
|
```bash |
|
|
pip install lerobot |
|
|
``` |
|
|
|
|
|
### Loading the Model |
|
|
```python |
|
|
from lerobot.policies import ACTPolicy |
|
|
from lerobot.configs.policies import ACTConfig |
|
|
|
|
|
# Load the model |
|
|
policy = ACTPolicy.from_pretrained("r2owb0/act1") |
|
|
``` |
|
|
|
|
|
### Evaluation |
|
|
```bash |
|
|
lerobot-eval \ |
|
|
--policy.path=r2owb0/act1 \ |
|
|
--env.type=your_env_type \ |
|
|
--eval.n_episodes=10 \ |
|
|
--eval.batch_size=10 |
|
|
``` |
|
|
|
|
|
### Inference |
|
|
```python |
|
|
import torch |
|
|
|
|
|
# Prepare observation |
|
|
observation = { |
|
|
"observation.state": torch.tensor([...]), # 6D robot state |
|
|
"observation.images.wrist": torch.tensor([...]), # 240x320x3 RGB |
|
|
"observation.images.top": torch.tensor([...]) # 480x640x3 RGB |
|
|
} |
|
|
|
|
|
# Get action |
|
|
with torch.no_grad(): |
|
|
action = policy.select_action(observation) |
|
|
``` |
|
|
|
|
|
## Hardware Requirements |
|
|
|
|
|
### Robot Setup |
|
|
- **Robot**: SO101 follower robot |
|
|
- **Cameras**: |
|
|
- Wrist-mounted camera (240×320 resolution) |
|
|
- Top-mounted camera (480×640 resolution) |
|
|
- **Control**: 6-DOF arm with gripper |
|
|
|
|
|
### Computing Requirements |
|
|
- **GPU**: CUDA-compatible GPU recommended |
|
|
- **Memory**: At least 4GB GPU memory |
|
|
- **Storage**: ~200MB for model weights |
|
|
|
|
|
## Performance Notes |
|
|
|
|
|
- The model uses action chunking, predicting 50 steps ahead but executing 15 steps at a time |
|
|
- Temporal ensembling is disabled for real-time inference |
|
|
- The model expects normalized inputs (mean/std normalization) |
|
|
- VAE is enabled for better representation learning |
|
|
|
|
|
## Limitations |
|
|
|
|
|
- Trained on a specific robot configuration (SO101) |
|
|
- Requires the exact camera setup described above |
|
|
- Performance may vary with different lighting conditions |
|
|
- Limited to the task domain covered in the training dataset |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use this model in your research, please cite: |
|
|
|
|
|
```bibtex |
|
|
@misc{r2owb0_act1, |
|
|
author = {Robert}, |
|
|
title = {ACT Model for SO101 Robot}, |
|
|
year = {2024}, |
|
|
publisher = {Hugging Face}, |
|
|
url = {https://huggingface.co/r2owb0/act1} |
|
|
} |
|
|
``` |
|
|
|
|
|
## License |
|
|
|
|
|
This model is licensed under the Apache 2.0 License. |
|
|
|