bubbliiiing
commited on
Commit
ยท
4ffa5b4
1
Parent(s):
b26307a
Update Readme
Browse files- README.md +160 -251
- README_en.md +207 -0
README.md
CHANGED
|
@@ -9,290 +9,199 @@ tags:
|
|
| 9 |
- video
|
| 10 |
- video-generation
|
| 11 |
---
|
| 12 |
-
# Wan2.1
|
| 13 |
|
| 14 |
-
|
| 15 |
-
|
| 16 |
-
|
| 17 |
-
|
| 18 |
-
|
| 19 |
-
|
| 20 |
-
|
| 21 |
-
|
| 22 |
-
|
| 23 |
-
|
| 24 |
-
|
| 25 |
-
|
| 26 |
-
|
| 27 |
-
-
|
| 28 |
-
-
|
| 29 |
-
-
|
| 30 |
-
|
| 31 |
-
|
| 32 |
-
|
| 33 |
-
|
| 34 |
-
|
| 35 |
-
|
| 36 |
-
|
| 37 |
-
|
| 38 |
-
|
| 39 |
-
|
| 40 |
-
|
| 41 |
-
|
| 42 |
-
|
| 43 |
-
|
| 44 |
-
|
| 45 |
-
|
| 46 |
-
|
| 47 |
-
|
| 48 |
-
|
| 49 |
-
|
| 50 |
-
|
| 51 |
-
|
| 52 |
-
|
| 53 |
-
|
| 54 |
-
|
| 55 |
-
|
| 56 |
-
|
| 57 |
-
- [ ] Diffusers integration
|
| 58 |
-
- [ ] ComfyUI integration
|
| 59 |
-
- Wan2.1 Image-to-Video
|
| 60 |
-
- [x] Multi-GPU Inference code of the 14B model
|
| 61 |
-
- [x] Checkpoints of the 14B model
|
| 62 |
-
- [x] Gradio demo
|
| 63 |
-
- [ ] Diffusers integration
|
| 64 |
-
- [ ] ComfyUI integration
|
| 65 |
-
|
| 66 |
-
|
| 67 |
-
## Quickstart
|
| 68 |
-
|
| 69 |
-
#### Installation
|
| 70 |
-
Clone the repo:
|
| 71 |
-
```
|
| 72 |
-
git clone https://github.com/Wan-Video/Wan2.1.git
|
| 73 |
-
cd Wan2.1
|
| 74 |
-
```
|
| 75 |
-
|
| 76 |
-
Install dependencies:
|
| 77 |
-
```
|
| 78 |
-
# Ensure torch >= 2.4.0
|
| 79 |
-
pip install -r requirements.txt
|
| 80 |
-
```
|
| 81 |
-
|
| 82 |
-
|
| 83 |
-
#### Model Download
|
| 84 |
-
|
| 85 |
-
| Models | Download Link | Notes |
|
| 86 |
-
| --------------|-------------------------------------------------------------------------------|-------------------------------|
|
| 87 |
-
| T2V-14B | ๐ค [Huggingface](https://huggingface.co/Wan-AI/Wan2.1-T2V-14B) ๐ค [ModelScope](https://www.modelscope.cn/models/Wan-AI/Wan2.1-T2V-14B) | Supports both 480P and 720P
|
| 88 |
-
| I2V-14B-720P | ๐ค [Huggingface](https://huggingface.co/Wan-AI/Wan2.1-I2V-14B-720P) ๐ค [ModelScope](https://www.modelscope.cn/models/Wan-AI/Wan2.1-I2V-14B-720P) | Supports 720P
|
| 89 |
-
| I2V-14B-480P | ๐ค [Huggingface](https://huggingface.co/Wan-AI/Wan2.1-I2V-14B-480P) ๐ค [ModelScope](https://www.modelscope.cn/models/Wan-AI/Wan2.1-I2V-14B-480P) | Supports 480P
|
| 90 |
-
| T2V-1.3B | ๐ค [Huggingface](https://huggingface.co/Wan-AI/Wan2.1-T2V-1.3B) ๐ค [ModelScope](https://www.modelscope.cn/models/Wan-AI/Wan2.1-T2V-1.3B) | Supports 480P
|
| 91 |
-
|
| 92 |
-
|
| 93 |
-
> ๐กNote: The 1.3B model is capable of generating videos at 720P resolution. However, due to limited training at this resolution, the results are generally less stable compared to 480P. For optimal performance, we recommend using 480P resolution.
|
| 94 |
-
|
| 95 |
-
|
| 96 |
-
Download models using ๐ค huggingface-cli:
|
| 97 |
-
```
|
| 98 |
-
pip install "huggingface_hub[cli]"
|
| 99 |
-
huggingface-cli download Wan-AI/Wan2.1-T2V-1.3B --local-dir ./Wan2.1-T2V-1.3B
|
| 100 |
-
```
|
| 101 |
-
|
| 102 |
-
Download models using ๐ค modelscope-cli:
|
| 103 |
-
```
|
| 104 |
-
pip install modelscope
|
| 105 |
-
modelscope download Wan-AI/Wan2.1-T2V-1.3B --local_dir ./Wan2.1-T2V-1.3B
|
| 106 |
-
```
|
| 107 |
-
|
| 108 |
-
#### Run Text-to-Video Generation
|
| 109 |
-
|
| 110 |
-
This repository supports two Text-to-Video models (1.3B and 14B) and two resolutions (480P and 720P). The parameters and configurations for these models are as follows:
|
| 111 |
-
|
| 112 |
-
<table>
|
| 113 |
-
<thead>
|
| 114 |
-
<tr>
|
| 115 |
-
<th rowspan="2">Task</th>
|
| 116 |
-
<th colspan="2">Resolution</th>
|
| 117 |
-
<th rowspan="2">Model</th>
|
| 118 |
-
</tr>
|
| 119 |
-
<tr>
|
| 120 |
-
<th>480P</th>
|
| 121 |
-
<th>720P</th>
|
| 122 |
-
</tr>
|
| 123 |
-
</thead>
|
| 124 |
-
<tbody>
|
| 125 |
-
<tr>
|
| 126 |
-
<td>t2v-14B</td>
|
| 127 |
-
<td style="color: green;">โ๏ธ</td>
|
| 128 |
-
<td style="color: green;">โ๏ธ</td>
|
| 129 |
-
<td>Wan2.1-T2V-14B</td>
|
| 130 |
-
</tr>
|
| 131 |
-
<tr>
|
| 132 |
-
<td>t2v-1.3B</td>
|
| 133 |
-
<td style="color: green;">โ๏ธ</td>
|
| 134 |
-
<td style="color: red;">โ</td>
|
| 135 |
-
<td>Wan2.1-T2V-1.3B</td>
|
| 136 |
-
</tr>
|
| 137 |
-
</tbody>
|
| 138 |
</table>
|
| 139 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 140 |
|
| 141 |
-
|
|
|
|
|
|
|
|
|
|
| 142 |
|
| 143 |
-
|
| 144 |
|
| 145 |
-
|
| 146 |
|
| 147 |
-
|
| 148 |
-
|
| 149 |
-
|
| 150 |
|
| 151 |
-
|
|
|
|
| 152 |
|
| 153 |
```
|
| 154 |
-
|
| 155 |
-
|
| 156 |
-
|
| 157 |
-
> ๐กNote: If you are using the `T2V-1.3B` model, we recommend setting the parameter `--sample_guide_scale 6`. The `--sample_shift parameter` can be adjusted within the range of 8 to 12 based on the performance.
|
| 158 |
|
| 159 |
-
|
|
|
|
| 160 |
|
| 161 |
-
|
| 162 |
-
|
| 163 |
-
torchrun --nproc_per_node=8 generate.py --task t2v-1.3B --size 832*480 --ckpt_dir ./Wan2.1-T2V-1.3B --dit_fsdp --t5_fsdp --ulysses_size 8 --sample_shift 8 --sample_guide_scale 6 --prompt "Two anthropomorphic cats in comfy boxing gear and bright gloves fight intensely on a spotlighted stage."
|
| 164 |
-
```
|
| 165 |
|
|
|
|
|
|
|
| 166 |
|
| 167 |
-
|
|
|
|
|
|
|
| 168 |
|
| 169 |
-
|
|
|
|
|
|
|
|
|
|
| 170 |
|
| 171 |
-
|
| 172 |
-
|
| 173 |
-
|
| 174 |
-
- Use the `qwen-plus` model for text-to-video tasks and `qwen-vl-max` for image-to-video tasks.
|
| 175 |
-
- You can modify the model used for extension with the parameter `--prompt_extend_model`. For example:
|
| 176 |
-
```
|
| 177 |
-
DASH_API_KEY=your_key python generate.py --task t2v-1.3B --size 832*480 --ckpt_dir ./Wan2.1-T2V-1.3B --prompt "Two anthropomorphic cats in comfy boxing gear and bright gloves fight intensely on a spotlighted stage" --use_prompt_extend --prompt_extend_method 'dashscope' --prompt_extend_target_lang 'ch'
|
| 178 |
```
|
| 179 |
|
| 180 |
-
|
|
|
|
|
|
|
| 181 |
|
| 182 |
-
|
| 183 |
-
|
| 184 |
-
|
| 185 |
-
|
| 186 |
-
|
|
|
|
|
|
|
| 187 |
|
| 188 |
-
|
| 189 |
-
|
| 190 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 191 |
|
| 192 |
-
|
| 193 |
|
| 194 |
-
|
| 195 |
-
|
| 196 |
-
# if one uses dashscopeโs API for prompt extension
|
| 197 |
-
DASH_API_KEY=your_key python t2v_1.3B_singleGPU.py --prompt_extend_method 'dashscope' --ckpt_dir ./Wan2.1-T2V-1.3B
|
| 198 |
|
| 199 |
-
|
| 200 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 201 |
```
|
| 202 |
|
|
|
|
| 203 |
|
|
|
|
| 204 |
|
| 205 |
-
|
| 206 |
-
|
| 207 |
-
We employ our **Wan-Bench** framework to evaluate the performance of the T2V-1.3B model, with the results displayed in the table below. The results indicate that our smaller 1.3B model surpasses the overall metrics of larger open-source models, demonstrating the effectiveness of **WanX2.1**'s architecture and the data construction pipeline.
|
| 208 |
-
|
| 209 |
-
<div align="center">
|
| 210 |
-
<img src="assets/vben_1.3b_vs_sota.png" alt="" style="width: 80%;" />
|
| 211 |
-
</div>
|
| 212 |
-
|
| 213 |
-
|
| 214 |
-
|
| 215 |
-
## Computational Efficiency on Different GPUs
|
| 216 |
-
|
| 217 |
-
We test the computational efficiency of different **Wan2.1** models on different GPUs in the following table. The results are presented in the format: **Total time (s) / peak GPU memory (GB)**.
|
| 218 |
-
|
| 219 |
-
|
| 220 |
-
<div align="center">
|
| 221 |
-
<img src="assets/comp_effic.png" alt="" style="width: 80%;" />
|
| 222 |
-
</div>
|
| 223 |
-
|
| 224 |
-
> The parameter settings for the tests presented in this table are as follows:
|
| 225 |
-
> (1) For the 1.3B model on 8 GPUs, set `--ring_size 8` and `--ulysses_size 1`;
|
| 226 |
-
> (2) For the 14B model on 1 GPU, use `--offload_model True`;
|
| 227 |
-
> (3) For the 1.3B model on a single 4090 GPU, set `--offload_model True --t5_cpu`;
|
| 228 |
-
> (4) For all testings, no prompt extension was applied, meaning `--use_prompt_extend` was not enabled.
|
| 229 |
-
|
| 230 |
-
-------
|
| 231 |
-
|
| 232 |
-
## Introduction of Wan2.1
|
| 233 |
-
|
| 234 |
-
**Wan2.1** is designed on the mainstream diffusion transformer paradigm, achieving significant advancements in generative capabilities through a series of innovations. These include our novel spatio-temporal variational autoencoder (VAE), scalable training strategies, large-scale data construction, and automated evaluation metrics. Collectively, these contributions enhance the modelโs performance and versatility.
|
| 235 |
-
|
| 236 |
-
|
| 237 |
-
##### (1) 3D Variational Autoencoders
|
| 238 |
-
We propose a novel 3D causal VAE architecture, termed **Wan-VAE** specifically designed for video generation. By combining multiple strategies, we improve spatio-temporal compression, reduce memory usage, and ensure temporal causality. **Wan-VAE** demonstrates significant advantages in performance efficiency compared to other open-source VAEs. Furthermore, our **Wan-VAE** can encode and decode unlimited-length 1080P videos without losing historical temporal information, making it particularly well-suited for video generation tasks.
|
| 239 |
-
|
| 240 |
-
|
| 241 |
-
<div align="center">
|
| 242 |
-
<img src="assets/video_vae_res.jpg" alt="" style="width: 80%;" />
|
| 243 |
-
</div>
|
| 244 |
-
|
| 245 |
-
|
| 246 |
-
##### (2) Video Diffusion DiT
|
| 247 |
-
|
| 248 |
-
**Wan2.1** is designed using the Flow Matching framework within the paradigm of mainstream Diffusion Transformers. Our model's architecture uses the T5 Encoder to encode multilingual text input, with cross-attention in each transformer block embedding the text into the model structure. Additionally, we employ an MLP with a Linear layer and a SiLU layer to process the input time embeddings and predict six modulation parameters individually. This MLP is shared across all transformer blocks, with each block learning a distinct set of biases. Our experimental findings reveal a significant performance improvement with this approach at the same parameter scale.
|
| 249 |
-
|
| 250 |
-
<div align="center">
|
| 251 |
-
<img src="assets/video_dit_arch.jpg" alt="" style="width: 80%;" />
|
| 252 |
-
</div>
|
| 253 |
-
|
| 254 |
-
|
| 255 |
-
| Model | Dimension | Input Dimension | Output Dimension | Feedforward Dimension | Frequency Dimension | Number of Heads | Number of Layers |
|
| 256 |
-
|--------|-----------|-----------------|------------------|-----------------------|---------------------|-----------------|------------------|
|
| 257 |
-
| 1.3B | 1536 | 16 | 16 | 8960 | 256 | 12 | 30 |
|
| 258 |
-
| 14B | 5120 | 16 | 16 | 13824 | 256 | 40 | 40 |
|
| 259 |
-
|
| 260 |
-
|
| 261 |
-
|
| 262 |
-
##### Data
|
| 263 |
-
|
| 264 |
-
We curated and deduplicated a candidate dataset comprising a vast amount of image and video data. During the data curation process, we designed a four-step data cleaning process, focusing on fundamental dimensions, visual quality and motion quality. Through the robust data processing pipeline, we can easily obtain high-quality, diverse, and large-scale training sets of images and videos.
|
| 265 |
-
|
| 266 |
-

|
| 267 |
-
|
| 268 |
-
|
| 269 |
-
##### Comparisons to SOTA
|
| 270 |
-
We compared **Wan2.1** with leading open-source and closed-source models to evaluate the performace. Using our carefully designed set of 1,035 internal prompts, we tested across 14 major dimensions and 26 sub-dimensions. Then we calculated the total score through a weighted average based on the importance of each dimension. The detailed results are shown in the table below. These results demonstrate our model's superior performance compared to both open-source and closed-source models.
|
| 271 |
-
|
| 272 |
-

|
| 273 |
-
|
| 274 |
|
| 275 |
-
|
| 276 |
-
|
|
|
|
| 277 |
|
| 278 |
-
|
| 279 |
-
@article{wan2.1,
|
| 280 |
-
title = {Wan: Open and Advanced Large-Scale Video Generative Models},
|
| 281 |
-
author = {Wan Team},
|
| 282 |
-
journal = {},
|
| 283 |
-
year = {2025}
|
| 284 |
-
}
|
| 285 |
-
```
|
| 286 |
|
| 287 |
-
|
| 288 |
-
|
| 289 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 290 |
|
| 291 |
-
|
| 292 |
|
| 293 |
-
|
| 294 |
|
|
|
|
|
|
|
|
|
|
| 295 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 296 |
|
| 297 |
-
## Contact Us
|
| 298 |
-
If you would like to leave a message to our research or product teams, feel free to join our [Discord](https://discord.gg/p5XbdQV7) or [WeChat groups](https://gw.alicdn.com/imgextra/i2/O1CN01tqjWFi1ByuyehkTSB_!!6000000000015-0-tps-611-1279.jpg)!
|
|
|
|
| 9 |
- video
|
| 10 |
- video-generation
|
| 11 |
---
|
|
|
|
| 12 |
|
| 13 |
+
# Wan-Fun
|
| 14 |
+
|
| 15 |
+
๐ Welcome!
|
| 16 |
+
|
| 17 |
+
[](https://huggingface.co/spaces/alibaba-pai/Wan-Fun-1.3b)
|
| 18 |
+
|
| 19 |
+
[English](./README_en.md) | [็ฎไฝไธญๆ](./README.md)
|
| 20 |
+
|
| 21 |
+
# ็ฎๅฝ
|
| 22 |
+
- [็ฎๅฝ](#็ฎๅฝ)
|
| 23 |
+
- [ๆจกๅๅฐๅ](#ๆจกๅๅฐๅ)
|
| 24 |
+
- [่ง้ขไฝๅ](#่ง้ขไฝๅ)
|
| 25 |
+
- [ๅฟซ้ๅฏๅจ](#ๅฟซ้ๅฏๅจ)
|
| 26 |
+
- [ๅฆไฝไฝฟ็จ](#ๅฆไฝไฝฟ็จ)
|
| 27 |
+
- [ๅ่ๆ็ฎ](#ๅ่ๆ็ฎ)
|
| 28 |
+
- [่ฎธๅฏ่ฏ](#่ฎธๅฏ่ฏ)
|
| 29 |
+
|
| 30 |
+
# ๆจกๅๅฐๅ
|
| 31 |
+
V1.0:
|
| 32 |
+
| ๅ็งฐ | ๅญๅจ็ฉบ้ด | Hugging Face | Model Scope | ๆ่ฟฐ |
|
| 33 |
+
|--|--|--|--|--|
|
| 34 |
+
| Wan2.1-Fun-1.3B-InP | 13.0 GB | [๐คLink](https://huggingface.co/alibaba-pai/Wan2.1-Fun-1.3B-InP) | [๐Link](https://modelscope.cn/models/PAI/Wan2.1-Fun-1.3B-InP) | Wan2.1-Fun-1.3Bๆๅพ็่ง้ขๆ้๏ผไปฅๅคๅ่พจ็่ฎญ็ป๏ผๆฏๆ้ฆๅฐพๅพ้ขๆตใ |
|
| 35 |
+
| Wan2.1-Fun-14B-InP | 20.0 GB | [๐คLink](https://huggingface.co/alibaba-pai/Wan2.1-Fun-14B-InP) | [๐Link](https://modelscope.cn/models/PAI/Wan2.1-Fun-14B-InP) | Wan2.1-Fun-14Bๆๅพ็่ง้ขๆ้๏ผไปฅๅคๅ่พจ็่ฎญ็ป๏ผๆฏๆ้ฆๅฐพๅพ้ขๆตใ |
|
| 36 |
+
|
| 37 |
+
# ่ง้ขไฝๅ
|
| 38 |
+
|
| 39 |
+
### Wan2.1-Fun-14B-InP
|
| 40 |
+
|
| 41 |
+
<table border="0" style="width: 100%; text-align: left; margin-top: 20px;">
|
| 42 |
+
<tr>
|
| 43 |
+
<td>
|
| 44 |
+
<video src="https://github.com/user-attachments/assets/4e10d491-f1cf-4b08-a7c5-1e01e5418140" width="100%" controls autoplay loop></video>
|
| 45 |
+
</td>
|
| 46 |
+
<td>
|
| 47 |
+
<video src="https://github.com/user-attachments/assets/bd72a276-e60e-4b5d-86c1-d0f67e7425b9" width="100%" controls autoplay loop></video>
|
| 48 |
+
</td>
|
| 49 |
+
<td>
|
| 50 |
+
<video src="https://github.com/user-attachments/assets/cb7aef09-52c2-4973-80b4-b2fb63425044" width="100%" controls autoplay loop></video>
|
| 51 |
+
</td>
|
| 52 |
+
<td>
|
| 53 |
+
<video src="https://github.com/user-attachments/assets/f7e363a9-be09-4b72-bccf-cce9c9ebeb9b" width="100%" controls autoplay loop></video>
|
| 54 |
+
</td>
|
| 55 |
+
</tr>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 56 |
</table>
|
| 57 |
|
| 58 |
+
### Wan2.1-Fun-1.3B-InP
|
| 59 |
+
|
| 60 |
+
<table border="0" style="width: 100%; text-align: left; margin-top: 20px;">
|
| 61 |
+
<tr>
|
| 62 |
+
<td>
|
| 63 |
+
<video src="https://github.com/user-attachments/assets/28f3e720-8acc-4f22-a5d0-ec1c571e9466" width="100%" controls autoplay loop></video>
|
| 64 |
+
</td>
|
| 65 |
+
<td>
|
| 66 |
+
<video src="https://github.com/user-attachments/assets/fb6e4cb9-270d-47cd-8501-caf8f3e91b5c" width="100%" controls autoplay loop></video>
|
| 67 |
+
</td>
|
| 68 |
+
<td>
|
| 69 |
+
<video src="https://github.com/user-attachments/assets/989a4644-e33b-4f0c-b68e-2ff6ba37ac7e" width="100%" controls autoplay loop></video>
|
| 70 |
+
</td>
|
| 71 |
+
<td>
|
| 72 |
+
<video src="https://github.com/user-attachments/assets/9c604fa7-8657-49d1-8066-b5bb198b28b6" width="100%" controls autoplay loop></video>
|
| 73 |
+
</td>
|
| 74 |
+
</tr>
|
| 75 |
+
</table>
|
| 76 |
|
| 77 |
+
# ๅฟซ้ๅฏๅจ
|
| 78 |
+
### 1. ไบไฝฟ็จ: AliyunDSW/Docker
|
| 79 |
+
#### a. ้่ฟ้ฟ้ไบ DSW
|
| 80 |
+
DSW ๆๅ
่ดน GPU ๆถ้ด๏ผ็จๆทๅฏ็ณ่ฏทไธๆฌก๏ผ็ณ่ฏทๅ3ไธชๆๅ
ๆๆใ
|
| 81 |
|
| 82 |
+
้ฟ้ไบๅจ[Freetier](https://free.aliyun.com/?product=9602825&crowd=enterprise&spm=5176.28055625.J_5831864660.1.e939154aRgha4e&scm=20140722.M_9974135.P_110.MO_1806-ID_9974135-MID_9974135-CID_30683-ST_8512-V_1)ๆไพๅ
่ดนGPUๆถ้ด๏ผ่ทๅๅนถๅจ้ฟ้ไบPAI-DSWไธญไฝฟ็จ๏ผ5ๅ้ๅ
ๅณๅฏๅฏๅจCogVideoX-Funใ
|
| 83 |
|
| 84 |
+
[](https://gallery.pai-ml.com/#/preview/deepLearning/cv/cogvideox_fun)
|
| 85 |
|
| 86 |
+
#### b. ้่ฟComfyUI
|
| 87 |
+
ๆไปฌ็ComfyUI็้ขๅฆไธ๏ผๅ
ทไฝๆฅ็[ComfyUI README](comfyui/README.md)ใ
|
| 88 |
+

|
| 89 |
|
| 90 |
+
#### c. ้่ฟdocker
|
| 91 |
+
ไฝฟ็จdocker็ๆ
ๅตไธ๏ผ่ฏทไฟ่ฏๆบๅจไธญๅทฒ็ปๆญฃ็กฎๅฎ่ฃ
ๆพๅก้ฉฑๅจไธCUDA็ฏๅข๏ผ็ถๅไปฅๆญคๆง่กไปฅไธๅฝไปค๏ผ
|
| 92 |
|
| 93 |
```
|
| 94 |
+
# pull image
|
| 95 |
+
docker pull mybigpai-public-registry.cn-beijing.cr.aliyuncs.com/easycv/torch_cuda:cogvideox_fun
|
|
|
|
|
|
|
| 96 |
|
| 97 |
+
# enter image
|
| 98 |
+
docker run -it -p 7860:7860 --network host --gpus all --security-opt seccomp:unconfined --shm-size 200g mybigpai-public-registry.cn-beijing.cr.aliyuncs.com/easycv/torch_cuda:cogvideox_fun
|
| 99 |
|
| 100 |
+
# clone code
|
| 101 |
+
git clone https://github.com/aigc-apps/CogVideoX-Fun.git
|
|
|
|
|
|
|
| 102 |
|
| 103 |
+
# enter CogVideoX-Fun's dir
|
| 104 |
+
cd CogVideoX-Fun
|
| 105 |
|
| 106 |
+
# download weights
|
| 107 |
+
mkdir models/Diffusion_Transformer
|
| 108 |
+
mkdir models/Personalized_Model
|
| 109 |
|
| 110 |
+
# Please use the hugginface link or modelscope link to download the model.
|
| 111 |
+
# CogVideoX-Fun
|
| 112 |
+
# https://huggingface.co/alibaba-pai/CogVideoX-Fun-V1.1-5b-InP
|
| 113 |
+
# https://modelscope.cn/models/PAI/CogVideoX-Fun-V1.1-5b-InP
|
| 114 |
|
| 115 |
+
# Wan
|
| 116 |
+
# https://huggingface.co/alibaba-pai/Wan2.1-Fun-14B-InP
|
| 117 |
+
# https://modelscope.cn/models/PAI/Wan2.1-Fun-14B-InP
|
|
|
|
|
|
|
|
|
|
|
|
|
| 118 |
```
|
| 119 |
|
| 120 |
+
### 2. ๆฌๅฐๅฎ่ฃ
: ็ฏๅขๆฃๆฅ/ไธ่ฝฝ/ๅฎ่ฃ
|
| 121 |
+
#### a. ็ฏๅขๆฃๆฅ
|
| 122 |
+
ๆไปฌๅทฒ้ช่ฏ่ฏฅๅบๅฏๅจไปฅไธ็ฏๅขไธญๆง่ก๏ผ
|
| 123 |
|
| 124 |
+
Windows ็่ฏฆ็ปไฟกๆฏ๏ผ
|
| 125 |
+
- ๆไฝ็ณป็ป Windows 10
|
| 126 |
+
- python: python3.10 & python3.11
|
| 127 |
+
- pytorch: torch2.2.0
|
| 128 |
+
- CUDA: 11.8 & 12.1
|
| 129 |
+
- CUDNN: 8+
|
| 130 |
+
- GPU๏ผ Nvidia-3060 12G & Nvidia-3090 24G
|
| 131 |
|
| 132 |
+
Linux ็่ฏฆ็ปไฟกๆฏ๏ผ
|
| 133 |
+
- ๆไฝ็ณป็ป Ubuntu 20.04, CentOS
|
| 134 |
+
- python: python3.10 & python3.11
|
| 135 |
+
- pytorch: torch2.2.0
|
| 136 |
+
- CUDA: 11.8 & 12.1
|
| 137 |
+
- CUDNN: 8+
|
| 138 |
+
- GPU๏ผNvidia-V100 16G & Nvidia-A10 24G & Nvidia-A100 40G & Nvidia-A100 80G
|
| 139 |
|
| 140 |
+
ๆไปฌ้่ฆๅคง็บฆ 60GB ็ๅฏ็จ็ฃ็็ฉบ้ด๏ผ่ฏทๆฃๆฅ๏ผ
|
| 141 |
|
| 142 |
+
#### b. ๆ้ๆพ็ฝฎ
|
| 143 |
+
ๆไปฌๆๅฅฝๅฐ[ๆ้](#model-zoo)ๆ็
งๆๅฎ่ทฏๅพ่ฟ่กๆพ็ฝฎ๏ผ
|
|
|
|
|
|
|
| 144 |
|
| 145 |
+
```
|
| 146 |
+
๐ฆ models/
|
| 147 |
+
โโโ ๐ Diffusion_Transformer/
|
| 148 |
+
โ โโโ ๐ CogVideoX-Fun-V1.1-2b-InP/
|
| 149 |
+
โ โโโ ๐ CogVideoX-Fun-V1.1-5b-InP/
|
| 150 |
+
โ โโโ ๐ Wan2.1-Fun-14B-InP
|
| 151 |
+
โ โโโ ๐ Wan2.1-Fun-1.3B-InP/
|
| 152 |
+
โโโ ๐ Personalized_Model/
|
| 153 |
+
โ โโโ your trained trainformer model / your trained lora model (for UI load)
|
| 154 |
```
|
| 155 |
|
| 156 |
+
# ๅฆไฝไฝฟ็จ
|
| 157 |
|
| 158 |
+
<h3 id="video-gen">1. ็ๆ </h3>
|
| 159 |
|
| 160 |
+
#### aใๆพๅญ่็ๆนๆก
|
| 161 |
+
็ฑไบWan2.1็ๅๆฐ้ๅธธๅคง๏ผๆไปฌ้่ฆ่่ๆพๅญ่็ๆนๆก๏ผไปฅ่็ๆพๅญ้ๅบๆถ่ดน็บงๆพๅกใๆไปฌ็ปๆฏไธช้ขๆตๆไปถ้ฝๆไพไบGPU_memory_mode๏ผๅฏไปฅๅจmodel_cpu_offload๏ผmodel_cpu_offload_and_qfloat8๏ผsequential_cpu_offloadไธญ่ฟ่ก้ๆฉใ่ฏฅๆนๆกๅๆ ท้็จไบCogVideoX-Fun็็ๆใ
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 162 |
|
| 163 |
+
- model_cpu_offloadไปฃ่กจๆดไธชๆจกๅๅจไฝฟ็จๅไผ่ฟๅ
ฅcpu๏ผๅฏไปฅ่็้จๅๆพๅญใ
|
| 164 |
+
- model_cpu_offload_and_qfloat8ไปฃ่กจๆดไธชๆจกๅๅจไฝฟ็จๅไผ่ฟๅ
ฅcpu๏ผๅนถไธๅฏนtransformerๆจกๅ่ฟ่กไบfloat8็้ๅ๏ผๅฏไปฅ่็ๆดๅค็ๆพๅญใ
|
| 165 |
+
- sequential_cpu_offloadไปฃ่กจๆจกๅ็ๆฏไธๅฑๅจไฝฟ็จๅไผ่ฟๅ
ฅcpu๏ผ้ๅบฆ่พๆ
ข๏ผ่็ๅคง้ๆพๅญใ
|
| 166 |
|
| 167 |
+
qfloat8ไผ้จๅ้ไฝๆจกๅ็ๆง่ฝ๏ผไฝๅฏไปฅ่็ๆดๅค็ๆพๅญใๅฆๆๆพๅญ่ถณๅค๏ผๆจ่ไฝฟ็จmodel_cpu_offloadใ
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 168 |
|
| 169 |
+
#### bใ้่ฟcomfyui
|
| 170 |
+
ๅ
ทไฝๆฅ็[ComfyUI README](comfyui/README.md)ใ
|
| 171 |
|
| 172 |
+
#### cใ่ฟ่กpythonๆไปถ
|
| 173 |
+
- ๆญฅ้ชค1๏ผไธ่ฝฝๅฏนๅบ[ๆ้](#model-zoo)ๆพๅ
ฅmodelsๆไปถๅคนใ
|
| 174 |
+
- ๆญฅ้ชค2๏ผๆ นๆฎไธๅ็ๆ้ไธ้ขๆต็ฎๆ ไฝฟ็จไธๅ็ๆไปถ่ฟ่ก้ขๆตใๅฝๅ่ฏฅๅบๆฏๆCogVideoX-FunใWan2.1ๅWan2.1-Fun๏ผๅจexamplesๆไปถๅคนไธ็จๆไปถๅคนๅไปฅๅบๅ๏ผไธๅๆจกๅๆฏๆ็ๅ่ฝไธๅ๏ผ่ฏท่งๅ
ทไฝๆ
ๅตไบไปฅๅบๅใไปฅCogVideoX-Funไธบไพใ
|
| 175 |
+
- ๆ็่ง้ข๏ผ
|
| 176 |
+
- ไฝฟ็จexamples/cogvideox_fun/predict_t2v.pyๆไปถไธญไฟฎๆนpromptใneg_promptใguidance_scaleๅseedใ
|
| 177 |
+
- ่ๅ่ฟ่กexamples/cogvideox_fun/predict_t2v.pyๆไปถ๏ผ็ญๅพ
็ๆ็ปๆ๏ผ็ปๆไฟๅญๅจsamples/cogvideox-fun-videosๆไปถๅคนไธญใ
|
| 178 |
+
- ๅพ็่ง้ข๏ผ
|
| 179 |
+
- ไฝฟ็จexamples/cogvideox_fun/predict_i2v.pyๆไปถไธญไฟฎๆนvalidation_image_startใvalidation_image_endใpromptใneg_promptใguidance_scaleๅseedใ
|
| 180 |
+
- validation_image_startๆฏ่ง้ข็ๅผๅงๅพ็๏ผvalidation_image_endๆฏ่ง้ข็็ปๅฐพๅพ็ใ
|
| 181 |
+
- ่ๅ่ฟ่กexamples/cogvideox_fun/predict_i2v.pyๆไปถ๏ผ็ญๅพ
็ๆ็ปๆ๏ผ็ปๆไฟๅญๅจsamples/cogvideox-fun-videos_i2vๆไปถๅคนไธญใ
|
| 182 |
+
- ่ง้ข็่ง้ข๏ผ
|
| 183 |
+
- ไฝฟ็จexamples/cogvideox_fun/predict_v2v.pyๆไปถไธญไฟฎๆนvalidation_videoใvalidation_image_endใpromptใneg_promptใguidance_scaleๅseedใ
|
| 184 |
+
- validation_videoๆฏ่ง้ข็่ง้ข็ๅ่่ง้ขใๆจๅฏไปฅไฝฟ็จไปฅไธ่ง้ข่ฟ่กๆผ็คบ๏ผ[ๆผ็คบ่ง้ข](https://pai-aigc-photog.oss-cn-hangzhou.aliyuncs.com/cogvideox_fun/asset/v1/play_guitar.mp4)
|
| 185 |
+
- ่ๅ่ฟ่กexamples/cogvideox_fun/predict_v2v.pyๆไปถ๏ผ็ญๅพ
็ๆ็ปๆ๏ผ็ปๆไฟๅญๅจsamples/cogvideox-fun-videos_v2vๆไปถๅคนไธญใ
|
| 186 |
+
- ๆฎ้ๆงๅถ็่ง้ข๏ผCannyใPoseใDepth็ญ๏ผ๏ผ
|
| 187 |
+
- ไฝฟ็จexamples/cogvideox_fun/predict_v2v_control.pyๆไปถไธญไฟฎๆนcontrol_videoใvalidation_image_endใpromptใneg_promptใguidance_scaleๅseedใ
|
| 188 |
+
- control_videoๆฏๆงๅถ็่ง้ข็ๆงๅถ่ง้ข๏ผๆฏไฝฟ็จCannyใPoseใDepth็ญ็ฎๅญๆๅๅ็่ง้ขใๆจๅฏไปฅไฝฟ็จไปฅไธ่ง้ข่ฟ่กๆผ็คบ๏ผ[ๆผ็คบ่ง้ข](https://pai-aigc-photog.oss-cn-hangzhou.aliyuncs.com/cogvideox_fun/asset/v1.1/pose.mp4)
|
| 189 |
+
- ่ๅ่ฟ่กexamples/cogvideox_fun/predict_v2v_control.pyๆไปถ๏ผ็ญๅพ
็ๆ็ปๆ๏ผ็ปๆไฟๅญๅจsamples/cogvideox-fun-videos_v2v_controlๆไปถๅคนไธญใ
|
| 190 |
+
- ๆญฅ้ชค3๏ผๅฆๆๆณ็ปๅ่ชๅทฑ่ฎญ็ป็ๅ
ถไปbackboneไธLora๏ผๅ็ๆ
ๅตไฟฎๆนexamples/{model_name}/predict_t2v.pyไธญ็examples/{model_name}/predict_i2v.pyๅlora_pathใ
|
| 191 |
|
| 192 |
+
#### dใ้่ฟui็้ข
|
| 193 |
|
| 194 |
+
webuiๆฏๆๆ็่ง้ขใๅพ็่ง้ขใ่ง้ข็่ง้ขๅๆฎ้ๆงๅถ็่ง้ข๏ผCannyใPoseใDepth็ญ๏ผใๅฝๅ่ฏฅๅบๆฏๆCogVideoX-FunใWan2.1ๅWan2.1-Fun๏ผๅจexamplesๆไปถๅคนไธ็จๆไปถๅคนๅไปฅๅบๅ๏ผไธๅๆจกๅๆฏๆ็ๅ่ฝไธๅ๏ผ่ฏท่งๅ
ทไฝๆ
ๅตไบไปฅๅบๅใไปฅCogVideoX-Funไธบไพใ
|
| 195 |
|
| 196 |
+
- ๆญฅ้ชค1๏ผไธ่ฝฝๅฏนๅบ[ๆ้](#model-zoo)ๆพๅ
ฅmodelsๆไปถๅคนใ
|
| 197 |
+
- ๆญฅ้ชค2๏ผ่ฟ่กexamples/cogvideox_fun/app.pyๆไปถ๏ผ่ฟๅ
ฅgradio้กต้ขใ
|
| 198 |
+
- ๆญฅ้ชค3๏ผๆ นๆฎ้กต้ข้ๆฉ็ๆๆจกๅ๏ผๅกซๅ
ฅpromptใneg_promptใguidance_scaleๅseed็ญ๏ผ็นๅป็ๆ๏ผ็ญๅพ
็ๆ็ปๆ๏ผ็ปๆไฟๅญๅจsampleๆไปถๅคนไธญใ
|
| 199 |
|
| 200 |
+
# ๅ่ๆ็ฎ
|
| 201 |
+
- CogVideo: https://github.com/THUDM/CogVideo/
|
| 202 |
+
- EasyAnimate: https://github.com/aigc-apps/EasyAnimate
|
| 203 |
+
- Wan2.1: https://github.com/Wan-Video/Wan2.1/
|
| 204 |
+
|
| 205 |
+
# ่ฎธๅฏ่ฏ
|
| 206 |
+
ๆฌ้กน็ฎ้็จ [Apache License (Version 2.0)](https://github.com/modelscope/modelscope/blob/master/LICENSE).
|
| 207 |
|
|
|
|
|
|
README_en.md
ADDED
|
@@ -0,0 +1,207 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
language:
|
| 4 |
+
- en
|
| 5 |
+
- zh
|
| 6 |
+
pipeline_tag: text-to-video
|
| 7 |
+
library_name: diffusers
|
| 8 |
+
tags:
|
| 9 |
+
- video
|
| 10 |
+
- video-generation
|
| 11 |
+
---
|
| 12 |
+
|
| 13 |
+
# Wan-Fun
|
| 14 |
+
|
| 15 |
+
๐ Welcome!
|
| 16 |
+
|
| 17 |
+
[](https://huggingface.co/spaces/alibaba-pai/Wan-Fun-1.3b)
|
| 18 |
+
|
| 19 |
+
[English](./README_en.md) | [็ฎไฝไธญๆ](./README.md)
|
| 20 |
+
|
| 21 |
+
# Table of Contents
|
| 22 |
+
- [Table of Contents](#table-of-contents)
|
| 23 |
+
- [Model zoo](#model-zoo)
|
| 24 |
+
- [Video Result](#video-result)
|
| 25 |
+
- [Quick Start](#quick-start)
|
| 26 |
+
- [How to use](#how-to-use)
|
| 27 |
+
- [Reference](#reference)
|
| 28 |
+
- [License](#license)
|
| 29 |
+
|
| 30 |
+
# Model zoo
|
| 31 |
+
V1.0:
|
| 32 |
+
| Name | Storage Space | Hugging Face | Model Scope | Description |
|
| 33 |
+
|--|--|--|--|--|
|
| 34 |
+
| Wan2.1-Fun-1.3B-InP | 13.0 GB | [๐คLink](https://huggingface.co/alibaba-pai/Wan2.1-Fun-1.3B-InP) | [๐Link](https://modelscope.cn/models/PAI/Wan2.1-Fun-1.3B-InP) | Wan2.1-Fun-1.3B text-to-video weights, trained at multiple resolutions, supporting start and end frame prediction. |
|
| 35 |
+
| Wan2.1-Fun-14B-InP | 20.0 GB | [๐คLink](https://huggingface.co/alibaba-pai/Wan2.1-Fun-14B-InP) | [๐Link](https://modelscope.cn/models/PAI/Wan2.1-Fun-14B-InP) | Wan2.1-Fun-14B text-to-video weights, trained at multiple resolutions, supporting start and end frame prediction. |
|
| 36 |
+
|
| 37 |
+
# Video Result
|
| 38 |
+
|
| 39 |
+
### Wan2.1-Fun-14B-InP
|
| 40 |
+
|
| 41 |
+
<table border="0" style="width: 100%; text-align: left; margin-top: 20px;">
|
| 42 |
+
<tr>
|
| 43 |
+
<td>
|
| 44 |
+
<video src="https://github.com/user-attachments/assets/4e10d491-f1cf-4b08-a7c5-1e01e5418140" width="100%" controls autoplay loop></video>
|
| 45 |
+
</td>
|
| 46 |
+
<td>
|
| 47 |
+
<video src="https://github.com/user-attachments/assets/bd72a276-e60e-4b5d-86c1-d0f67e7425b9" width="100%" controls autoplay loop></video>
|
| 48 |
+
</td>
|
| 49 |
+
<td>
|
| 50 |
+
<video src="https://github.com/user-attachments/assets/cb7aef09-52c2-4973-80b4-b2fb63425044" width="100%" controls autoplay loop></video>
|
| 51 |
+
</td>
|
| 52 |
+
<td>
|
| 53 |
+
<video src="https://github.com/user-attachments/assets/f7e363a9-be09-4b72-bccf-cce9c9ebeb9b" width="100%" controls autoplay loop></video>
|
| 54 |
+
</td>
|
| 55 |
+
</tr>
|
| 56 |
+
</table>
|
| 57 |
+
|
| 58 |
+
### Wan2.1-Fun-1.3B-InP
|
| 59 |
+
|
| 60 |
+
<table border="0" style="width: 100%; text-align: left; margin-top: 20px;">
|
| 61 |
+
<tr>
|
| 62 |
+
<td>
|
| 63 |
+
<video src="https://github.com/user-attachments/assets/28f3e720-8acc-4f22-a5d0-ec1c571e9466" width="100%" controls autoplay loop></video>
|
| 64 |
+
</td>
|
| 65 |
+
<td>
|
| 66 |
+
<video src="https://github.com/user-attachments/assets/fb6e4cb9-270d-47cd-8501-caf8f3e91b5c" width="100%" controls autoplay loop></video>
|
| 67 |
+
</td>
|
| 68 |
+
<td>
|
| 69 |
+
<video src="https://github.com/user-attachments/assets/989a4644-e33b-4f0c-b68e-2ff6ba37ac7e" width="100%" controls autoplay loop></video>
|
| 70 |
+
</td>
|
| 71 |
+
<td>
|
| 72 |
+
<video src="https://github.com/user-attachments/assets/9c604fa7-8657-49d1-8066-b5bb198b28b6" width="100%" controls autoplay loop></video>
|
| 73 |
+
</td>
|
| 74 |
+
</tr>
|
| 75 |
+
</table>
|
| 76 |
+
|
| 77 |
+
# Quick Start
|
| 78 |
+
### 1. Cloud usage: AliyunDSW/Docker
|
| 79 |
+
#### a. From AliyunDSW
|
| 80 |
+
DSW has free GPU time, which can be applied once by a user and is valid for 3 months after applying.
|
| 81 |
+
|
| 82 |
+
Aliyun provide free GPU time in [Freetier](https://free.aliyun.com/?product=9602825&crowd=enterprise&spm=5176.28055625.J_5831864660.1.e939154aRgha4e&scm=20140722.M_9974135.P_110.MO_1806-ID_9974135-MID_9974135-CID_30683-ST_8512-V_1), get it and use in Aliyun PAI-DSW to start CogVideoX-Fun within 5min!
|
| 83 |
+
|
| 84 |
+
[](https://gallery.pai-ml.com/#/preview/deepLearning/cv/cogvideox_fun)
|
| 85 |
+
|
| 86 |
+
#### b. From ComfyUI
|
| 87 |
+
Our ComfyUI is as follows, please refer to [ComfyUI README](comfyui/README.md) for details.
|
| 88 |
+

|
| 89 |
+
|
| 90 |
+
#### c. From docker
|
| 91 |
+
If you are using docker, please make sure that the graphics card driver and CUDA environment have been installed correctly in your machine.
|
| 92 |
+
|
| 93 |
+
Then execute the following commands in this way:
|
| 94 |
+
|
| 95 |
+
```
|
| 96 |
+
# pull image
|
| 97 |
+
docker pull mybigpai-public-registry.cn-beijing.cr.aliyuncs.com/easycv/torch_cuda:cogvideox_fun
|
| 98 |
+
|
| 99 |
+
# enter image
|
| 100 |
+
docker run -it -p 7860:7860 --network host --gpus all --security-opt seccomp:unconfined --shm-size 200g mybigpai-public-registry.cn-beijing.cr.aliyuncs.com/easycv/torch_cuda:cogvideox_fun
|
| 101 |
+
|
| 102 |
+
# clone code
|
| 103 |
+
git clone https://github.com/aigc-apps/CogVideoX-Fun.git
|
| 104 |
+
|
| 105 |
+
# enter CogVideoX-Fun's dir
|
| 106 |
+
cd CogVideoX-Fun
|
| 107 |
+
|
| 108 |
+
# download weights
|
| 109 |
+
mkdir models/Diffusion_Transformer
|
| 110 |
+
mkdir models/Personalized_Model
|
| 111 |
+
|
| 112 |
+
# Please use the hugginface link or modelscope link to download the model.
|
| 113 |
+
# CogVideoX-Fun
|
| 114 |
+
# https://huggingface.co/alibaba-pai/CogVideoX-Fun-V1.1-5b-InP
|
| 115 |
+
# https://modelscope.cn/models/PAI/CogVideoX-Fun-V1.1-5b-InP
|
| 116 |
+
|
| 117 |
+
# Wan
|
| 118 |
+
# https://huggingface.co/alibaba-pai/Wan2.1-Fun-14B-InP
|
| 119 |
+
# https://modelscope.cn/models/PAI/Wan2.1-Fun-14B-InP
|
| 120 |
+
```
|
| 121 |
+
|
| 122 |
+
### 2. Local install: Environment Check/Downloading/Installation
|
| 123 |
+
#### a. Environment Check
|
| 124 |
+
We have verified this repo execution on the following environment:
|
| 125 |
+
|
| 126 |
+
The detailed of Windows:
|
| 127 |
+
- OS: Windows 10
|
| 128 |
+
- python: python3.10 & python3.11
|
| 129 |
+
- pytorch: torch2.2.0
|
| 130 |
+
- CUDA: 11.8 & 12.1
|
| 131 |
+
- CUDNN: 8+
|
| 132 |
+
- GPU๏ผ Nvidia-3060 12G & Nvidia-3090 24G
|
| 133 |
+
|
| 134 |
+
The detailed of Linux:
|
| 135 |
+
- OS: Ubuntu 20.04, CentOS
|
| 136 |
+
- python: python3.10 & python3.11
|
| 137 |
+
- pytorch: torch2.2.0
|
| 138 |
+
- CUDA: 11.8 & 12.1
|
| 139 |
+
- CUDNN: 8+
|
| 140 |
+
- GPU๏ผNvidia-V100 16G & Nvidia-A10 24G & Nvidia-A100 40G & Nvidia-A100 80G
|
| 141 |
+
|
| 142 |
+
We need about 60GB available on disk (for saving weights), please check!
|
| 143 |
+
|
| 144 |
+
#### b. Weights
|
| 145 |
+
We'd better place the [weights](#model-zoo) along the specified path:
|
| 146 |
+
|
| 147 |
+
```
|
| 148 |
+
๐ฆ models/
|
| 149 |
+
โโโ ๐ Diffusion_Transformer/
|
| 150 |
+
โ โโโ ๐ CogVideoX-Fun-V1.1-2b-InP/
|
| 151 |
+
โ โโโ ๐ CogVideoX-Fun-V1.1-5b-InP/
|
| 152 |
+
โ โโโ ๐ Wan2.1-Fun-14B-InP
|
| 153 |
+
โ โโโ ๐ Wan2.1-Fun-1.3B-InP/
|
| 154 |
+
โโโ ๐ Personalized_Model/
|
| 155 |
+
โ โโโ your trained trainformer model / your trained lora model (for UI load)
|
| 156 |
+
```
|
| 157 |
+
|
| 158 |
+
# How to Use
|
| 159 |
+
|
| 160 |
+
<h3 id="video-gen">1. Generation</h3>
|
| 161 |
+
|
| 162 |
+
#### a. GPU Memory Optimization
|
| 163 |
+
Since Wan2.1 has a very large number of parameters, we need to consider memory optimization strategies to adapt to consumer-grade GPUs. We provide `GPU_memory_mode` for each prediction file, allowing you to choose between `model_cpu_offload`, `model_cpu_offload_and_qfloat8`, and `sequential_cpu_offload`. This solution is also applicable to CogVideoX-Fun generation.
|
| 164 |
+
|
| 165 |
+
- `model_cpu_offload`: The entire model is moved to the CPU after use, saving some GPU memory.
|
| 166 |
+
- `model_cpu_offload_and_qfloat8`: The entire model is moved to the CPU after use, and the transformer model is quantized to float8, saving more GPU memory.
|
| 167 |
+
- `sequential_cpu_offload`: Each layer of the model is moved to the CPU after use. It is slower but saves a significant amount of GPU memory.
|
| 168 |
+
|
| 169 |
+
`qfloat8` may slightly reduce model performance but saves more GPU memory. If you have sufficient GPU memory, it is recommended to use `model_cpu_offload`.
|
| 170 |
+
|
| 171 |
+
#### b. Using ComfyUI
|
| 172 |
+
For details, refer to [ComfyUI README](comfyui/README.md).
|
| 173 |
+
|
| 174 |
+
#### c. Running Python Files
|
| 175 |
+
- **Step 1**: Download the corresponding [weights](#model-zoo) and place them in the `models` folder.
|
| 176 |
+
- **Step 2**: Use different files for prediction based on the weights and prediction goals. This library currently supports CogVideoX-Fun, Wan2.1, and Wan2.1-Fun. Different models are distinguished by folder names under the `examples` folder, and their supported features vary. Use them accordingly. Below is an example using CogVideoX-Fun:
|
| 177 |
+
- **Text-to-Video**:
|
| 178 |
+
- Modify `prompt`, `neg_prompt`, `guidance_scale`, and `seed` in the file `examples/cogvideox_fun/predict_t2v.py`.
|
| 179 |
+
- Run the file `examples/cogvideox_fun/predict_t2v.py` and wait for the results. The generated videos will be saved in the folder `samples/cogvideox-fun-videos`.
|
| 180 |
+
- **Image-to-Video**:
|
| 181 |
+
- Modify `validation_image_start`, `validation_image_end`, `prompt`, `neg_prompt`, `guidance_scale`, and `seed` in the file `examples/cogvideox_fun/predict_i2v.py`.
|
| 182 |
+
- `validation_image_start` is the starting image of the video, and `validation_image_end` is the ending image of the video.
|
| 183 |
+
- Run the file `examples/cogvideox_fun/predict_i2v.py` and wait for the results. The generated videos will be saved in the folder `samples/cogvideox-fun-videos_i2v`.
|
| 184 |
+
- **Video-to-Video**:
|
| 185 |
+
- Modify `validation_video`, `validation_image_end`, `prompt`, `neg_prompt`, `guidance_scale`, and `seed` in the file `examples/cogvideox_fun/predict_v2v.py`.
|
| 186 |
+
- `validation_video` is the reference video for video-to-video generation. You can use the following demo video: [Demo Video](https://pai-aigc-photog.oss-cn-hangzhou.aliyuncs.com/cogvideox_fun/asset/v1/play_guitar.mp4).
|
| 187 |
+
- Run the file `examples/cogvideox_fun/predict_v2v.py` and wait for the results. The generated videos will be saved in the folder `samples/cogvideox-fun-videos_v2v`.
|
| 188 |
+
- **Controlled Video Generation (Canny, Pose, Depth, etc.)**:
|
| 189 |
+
- Modify `control_video`, `validation_image_end`, `prompt`, `neg_prompt`, `guidance_scale`, and `seed` in the file `examples/cogvideox_fun/predict_v2v_control.py`.
|
| 190 |
+
- `control_video` is the control video extracted using operators such as Canny, Pose, or Depth. You can use the following demo video: [Demo Video](https://pai-aigc-photog.oss-cn-hangzhou.aliyuncs.com/cogvideox_fun/asset/v1.1/pose.mp4).
|
| 191 |
+
- Run the file `examples/cogvideox_fun/predict_v2v_control.py` and wait for the results. The generated videos will be saved in the folder `samples/cogvideox-fun-videos_v2v_control`.
|
| 192 |
+
- **Step 3**: If you want to integrate other backbones or Loras trained by yourself, modify `lora_path` and relevant paths in `examples/{model_name}/predict_t2v.py` or `examples/{model_name}/predict_i2v.py` as needed.
|
| 193 |
+
|
| 194 |
+
#### d. Using the Web UI
|
| 195 |
+
The web UI supports text-to-video, image-to-video, video-to-video, and controlled video generation (Canny, Pose, Depth, etc.). This library currently supports CogVideoX-Fun, Wan2.1, and Wan2.1-Fun. Different models are distinguished by folder names under the `examples` folder, and their supported features vary. Use them accordingly. Below is an example using CogVideoX-Fun:
|
| 196 |
+
|
| 197 |
+
- **Step 1**: Download the corresponding [weights](#model-zoo) and place them in the `models` folder.
|
| 198 |
+
- **Step 2**: Run the file `examples/cogvideox_fun/app.py` to access the Gradio interface.
|
| 199 |
+
- **Step 3**: Select the generation model on the page, fill in `prompt`, `neg_prompt`, `guidance_scale`, and `seed`, click "Generate," and wait for the results. The generated videos will be saved in the `sample` folder.
|
| 200 |
+
|
| 201 |
+
# Reference
|
| 202 |
+
- CogVideo: https://github.com/THUDM/CogVideo/
|
| 203 |
+
- EasyAnimate: https://github.com/aigc-apps/EasyAnimate
|
| 204 |
+
- Wan2.1: https://github.com/Wan-Video/Wan2.1/
|
| 205 |
+
|
| 206 |
+
# License
|
| 207 |
+
This project is licensed under the [Apache License (Version 2.0)](https://github.com/modelscope/modelscope/blob/master/LICENSE).
|