WoW-world-model commited on
Commit
e8df522
Β·
verified Β·
1 Parent(s): 07f1983

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +90 -3
README.md CHANGED
@@ -1,3 +1,90 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - en
5
+ library_name: transformers
6
+ tags:
7
+ - video-generation
8
+ - robotics
9
+ - embodied-ai
10
+ - physical-reasoning
11
+ - causal-reasoning
12
+ - inverse-dynamics
13
+ - wow
14
+ - arxiv:2509.22642
15
+ datasets:
16
+ - WoW-world-model/WoW-1-Benchmark-Samples
17
+ pipeline_tag: video-generation
18
+ base_model: wan
19
+ ---
20
+
21
+ # πŸ€– WoW-1-Wan-14B-2M
22
+
23
+ **WoW-1-Wan-14B** is a 14-billion-parameter generative world model trained on **2 million real-world robot interaction trajectories**. It is designed to imagine, reason, and act in physically consistent environments, powered by SOPHIA-guided refinement and a co-trained **Inverse Dynamics Model**.
24
+
25
+ This model is part of the [WoW (World-Omniscient World Model)](https://github.com/wow-world-model/wow-world-model) project, introduced in the paper:
26
+
27
+ > **[WoW: Towards a World omniscient World model Through Embodied Interaction](https://arxiv.org/abs/2509.22642)**
28
+ > *Chi et al., 2025 – arXiv:2509.22642*
29
+
30
+ ## 🧠 Key Features
31
+
32
+ - **14B parameters** trained on **2M robot interaction samples**
33
+ - Learns **causal physical reasoning** from embodied action
34
+ - Generates physically consistent video and robotic action plans
35
+ - Uses **SOPHIA**, a vision-language critic, to refine outputs
36
+ - Paired with an **Inverse Dynamics Model** to complete imagination-to-action loop
37
+
38
+ ## πŸ§ͺ Training Data
39
+
40
+ <!-- - Dataset: [WoW-1-Benchmark-Samples](https://huggingface.co/datasets/WoW-world-model/WoW-1-Benchmark-Samples) -->
41
+ - **2M** Real-world robot interaction trajectories
42
+ - Multimodal scenes including vision, action, and language
43
+ - Diverse **mixture captions** for better generalization
44
+ ### 🧠 Mixture Caption Strategy
45
+
46
+ - **Prompt Lengths**:
47
+ - Short: *"The Franka robot, grasp the red bottle on the table"*
48
+ - Long: *"The scene... open the drawer, take the screwdriver, place it on the table..."*
49
+
50
+ - **Robot Model Mixing**:
51
+ - Captions reference various robot types
52
+ - Example: *"grasp with the Franka Panda arm"*, *"use end-effector to align"*
53
+
54
+ - **Action Granularity**:
55
+ - Coarse: *"move to object"*
56
+ - Fine: *"rotate wrist 30Β° before grasping"*
57
+
58
+
59
+ ## πŸ”„ Continuous Updates
60
+
61
+ This dataset will be **continuously updated** with:
62
+ - More trajectories
63
+ - Richer language
64
+ - Finer multimodal annotations
65
+
66
+ ## 🧩 Applications
67
+
68
+ - Zero-shot video generation in robotics
69
+ - Causal reasoning and physics simulation
70
+ - Long-horizon manipulation planning
71
+ - Forward and inverse control prediction
72
+
73
+ ## πŸ“„ Citation
74
+
75
+ ```bibtex
76
+ @article{chi2025wow,
77
+ title={WoW: Towards a World omniscient World model Through Embodied Interaction},
78
+ author={Chi, Xiaowei and Jia, Peidong and Fan, Chun-Kai and Ju, Xiaozhu and Mi, Weishi and Qin, Zhiyuan and Zhang, Kevin and Tian, Wanxin and Ge, Kuangzhi and Li, Hao and others},
79
+ journal={arXiv preprint arXiv:2509.22642},
80
+ year={2025}
81
+ }
82
+ ```
83
+
84
+ ## πŸ”— Resources
85
+
86
+ - 🧠 Project page: [wow-world-model.github.io](https://wow-world-model.github.io/)
87
+ - πŸ’» GitHub repo: [wow-world-model/wow-world-model](https://github.com/wow-world-model/wow-world-model)
88
+ - πŸ“Š Dataset: [WoW-1 Benchmark Samples](https://huggingface.co/datasets/WoW-world-model/WoW-1-Benchmark-Samples)
89
+
90
+ ---