File size: 2,317 Bytes
47d7f63
 
 
d267b5e
47d7f63
 
 
 
 
8cf8146
 
47d7f63
ce05866
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
47d7f63
ce05866
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
47d7f63
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
---
license: mit
datasets:
- cadene/droid_1.0.1
language:
- en
base_model:
- stabilityai/stable-video-diffusion-img2vid
pipeline_tag: robotics
tags:
- action_conditioned_video_model
---
<div align="center">
<h2><center>👉 Ctrl-World: A Controllable Generative World Model for Robot Manipulation </h2>

[Yanjiang Guo*](https://robert-gyj.github.io), [Lucy Xiaoyang Shi*](https://lucys0.github.io),  [Jianyu Chen](http://people.iiis.tsinghua.edu.cn/~jychen/), [Chelsea Finn](https://ai.stanford.edu/~cbfinn/)

 \*Equal contribution; Stanford University, Tsinghua University


<a href='https://arxiv.org/abs/2510.10125'><img src='https://img.shields.io/badge/ArXiv-2510.10125-red'></a> 
<a href='https://ctrl-world.github.io/'><img src='https://img.shields.io/badge/Project-Page-Blue'></a> 

</div>

## TL; DR: 
[**Ctrl-World**](https://sites.google.com/view/ctrl-world) is an action-conditioned world model compatible with modern VLA policies and enables policy-in-the-loop rollouts entirely in imagination, which can be used to evaluate and improve the **instruction following** ability of VLA. 

<p>
    <img src="ctrl_world.jpg" alt="wild-data" width="100%" />
</p>

## Model Details:
This repo include the Ctrl-World model checkpoint trained on opensourced [**DROID dataset**](https://droid-dataset.github.io/) (~95k trajectories, 564 scenes). 
The DROID platform consists of a Franka Panda robotic arm equipped with a Robotiq gripper and three cameras: two randomly placed third-person cameras and one wrist-mounted camera.

## Usage
See the official [**Ctrl-World github repo**](https://github.com/Robert-gyj/Ctrl-World/tree/main) for detailed usage.

## Acknowledgement

Ctrl-World is developed from the opensourced video foundation model [Stable-Video-Diffusion](https://github.com/Stability-AI/generative-models). The VLA model used in this repo is from [openpi](https://github.com/Physical-Intelligence/openpi). We thank the authors for their efforts!


## Bibtex 
If you find our work helpful, please leave us a star and cite our paper. Thank you!
```
@article{guo2025ctrl,
  title={Ctrl-World: A Controllable Generative World Model for Robot Manipulation},
  author={Guo, Yanjiang and Shi, Lucy Xiaoyang and Chen, Jianyu and Finn, Chelsea},
  journal={arXiv preprint arXiv:2510.10125},
  year={2025}
}
```