Ctrl-World / README.md
yjguo's picture
Update README.md
8cf8146 verified
metadata
license: mit
datasets:
  - cadene/droid_1.0.1
language:
  - en
base_model:
  - stabilityai/stable-video-diffusion-img2vid
pipeline_tag: robotics
tags:
  - action_conditioned_video_model

👉 Ctrl-World: A Controllable Generative World Model for Robot Manipulation

Yanjiang Guo*, Lucy Xiaoyang Shi*, Jianyu Chen, Chelsea Finn

*Equal contribution; Stanford University, Tsinghua University

TL; DR:

Ctrl-World is an action-conditioned world model compatible with modern VLA policies and enables policy-in-the-loop rollouts entirely in imagination, which can be used to evaluate and improve the instruction following ability of VLA.

wild-data

Model Details:

This repo include the Ctrl-World model checkpoint trained on opensourced DROID dataset (~95k trajectories, 564 scenes). The DROID platform consists of a Franka Panda robotic arm equipped with a Robotiq gripper and three cameras: two randomly placed third-person cameras and one wrist-mounted camera.

Usage

See the official Ctrl-World github repo for detailed usage.

Acknowledgement

Ctrl-World is developed from the opensourced video foundation model Stable-Video-Diffusion. The VLA model used in this repo is from openpi. We thank the authors for their efforts!

Bibtex

If you find our work helpful, please leave us a star and cite our paper. Thank you!

@article{guo2025ctrl,
  title={Ctrl-World: A Controllable Generative World Model for Robot Manipulation},
  author={Guo, Yanjiang and Shi, Lucy Xiaoyang and Chen, Jianyu and Finn, Chelsea},
  journal={arXiv preprint arXiv:2510.10125},
  year={2025}
}