File size: 4,495 Bytes
6a94aca
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
---
license: apache-2.0
library_name: lerobot
pipeline_tag: robotics
tags:
- robotics
- lerobot
- act
- imitation-learning
- so101
model_name: act
datasets: r2owb0/so101-DS1
base_model: lerobot/smolvla_base
---

# ACT Model for SO101 Robot

This is an Action Chunking Transformer (ACT) model trained for the SO101 robot using LeRobot. The model was trained on demonstration data collected from teleoperation sessions.

## Model Details

### Architecture
- **Model Type**: Action Chunking Transformer (ACT)
- **Vision Backbone**: ResNet18 with ImageNet pretrained weights
- **Transformer Configuration**:
  - Hidden dimension: 512
  - Number of heads: 8
  - Encoder layers: 4
  - Decoder layers: 1
  - Feedforward dimension: 3200
- **VAE**: Enabled with 32-dimensional latent space
- **Chunk Size**: 50 steps
- **Action Steps**: 15 steps per inference

### Camera Setup
The model uses a **dual-camera setup** for robust perception:

1. **Wrist Camera** (`observation.images.wrist`):
   - Resolution: 240×320 pixels
   - Position: Mounted on the robot's wrist
   - Purpose: Provides close-up, detailed view of manipulation tasks
   - Field of view: Narrow, focused on the immediate workspace

2. **Top Camera** (`observation.images.top`):
   - Resolution: 480×640 pixels  
   - Position: Mounted above the workspace
   - Purpose: Provides broader context and overview of the environment
   - Field of view: Wide, captures the entire workspace

### Input/Output Specifications

**Inputs:**
- **Robot State**: 6-dimensional joint positions
  - `shoulder_pan.pos`
  - `shoulder_lift.pos` 
  - `elbow_flex.pos`
  - `wrist_flex.pos`
  - `wrist_roll.pos`
  - `gripper.pos`
- **Wrist Camera**: RGB image (240×320×3)
- **Top Camera**: RGB image (480×640×3)

**Outputs:**
- **Actions**: 6-dimensional joint commands (same structure as state)

## Training Details

### Dataset
- **Source**: `r2owb0/so101-DS1`
- **Episodes**: 10 demonstration episodes
- **Total Frames**: 5,990 frames
- **Frame Rate**: 30 FPS
- **Robot Type**: SO101 follower robot

### Training Configuration
- **Training Steps**: 25,000
- **Batch Size**: 4
- **Learning Rate**: 1e-5
- **Optimizer**: AdamW with weight decay 1e-4
- **Validation Split**: 10% of episodes
- **Seed**: 1000

### Data Augmentation
The model was trained with comprehensive image augmentation:
- Brightness adjustment (0.8-1.2x)
- Contrast adjustment (0.8-1.2x)
- Saturation adjustment (0.5-1.5x)
- Hue adjustment (±0.05)
- Sharpness adjustment (0.5-1.5x)

## Usage

### Installation
```bash
pip install lerobot
```

### Loading the Model
```python
from lerobot.policies import ACTPolicy
from lerobot.configs.policies import ACTConfig

# Load the model
policy = ACTPolicy.from_pretrained("r2owb0/act1")
```

### Evaluation
```bash
lerobot-eval \
    --policy.path=r2owb0/act1 \
    --env.type=your_env_type \
    --eval.n_episodes=10 \
    --eval.batch_size=10
```

### Inference
```python
import torch

# Prepare observation
observation = {
    "observation.state": torch.tensor([...]),  # 6D robot state
    "observation.images.wrist": torch.tensor([...]),  # 240x320x3 RGB
    "observation.images.top": torch.tensor([...])     # 480x640x3 RGB
}

# Get action
with torch.no_grad():
    action = policy.select_action(observation)
```

## Hardware Requirements

### Robot Setup
- **Robot**: SO101 follower robot
- **Cameras**: 
  - Wrist-mounted camera (240×320 resolution)
  - Top-mounted camera (480×640 resolution)
- **Control**: 6-DOF arm with gripper

### Computing Requirements
- **GPU**: CUDA-compatible GPU recommended
- **Memory**: At least 4GB GPU memory
- **Storage**: ~200MB for model weights

## Performance Notes

- The model uses action chunking, predicting 50 steps ahead but executing 15 steps at a time
- Temporal ensembling is disabled for real-time inference
- The model expects normalized inputs (mean/std normalization)
- VAE is enabled for better representation learning

## Limitations

- Trained on a specific robot configuration (SO101)
- Requires the exact camera setup described above
- Performance may vary with different lighting conditions
- Limited to the task domain covered in the training dataset

## Citation

If you use this model in your research, please cite:

```bibtex
@misc{r2owb0_act1,
  author = {Robert},
  title = {ACT Model for SO101 Robot},
  year = {2024},
  publisher = {Hugging Face},
  url = {https://huggingface.co/r2owb0/act1}
}
```

## License

This model is licensed under the Apache 2.0 License.