bubbliiiing commited on
Commit
5155fc5
·
1 Parent(s): 3a7324b

Update 2602 for gray control

Browse files
README.md CHANGED
@@ -8,40 +8,46 @@ library_name: videox_fun
8
  [![Github](https://img.shields.io/badge/🎬%20Code-VideoX_Fun-blue)](https://github.com/aigc-apps/VideoX-Fun)
9
 
10
  ## Update
11
- - A new lite model has been added with Control Latents applied on 5 layers (only 1.9GB). The previous Control model had two issues: insufficient mask randomness causing the model to learn mask patterns and auto-fill during inpainting, and overfitting between control and tile distillation causing artifacts at large control_context_scale values. Both Control and Tile models have been retrained with enriched mask varieties and improved training schedules. Additionally, the dataset has been restructured with multi-resolution control images (512~1536) instead of single resolution (512) for better robustness. [2026.01.12]
12
- - During testing, we found that applying ControlNet to Z-Image-Turbo caused the model to lose its acceleration capability and become blurry. We performed 8-step distillation on the version 2.1 model, and the distilled model demonstrates better performance when using 8-step prediction. Additionally, we have uploaded a tile model that can be used for super-resolution generation. [2025.12.22]
13
- - Due to a typo in version 2.0, `control_layers` was used instead of `control_noise_refiner` to process refiner latents during training. Although the model converged normally, the model inference speed was slow because `control_layers` forward pass was performed twice. In version 2.1, we made an urgent fix and the speed has returned to normal. [2025.12.17]
 
14
 
15
  ## Model Card
 
 
 
 
 
16
 
17
- ### a. 2601 Models
18
  | Name | Description |
19
  |--|--|
20
- | Z-Image-Turbo-Fun-Controlnet-Union-2.1-2601-8steps.safetensors | Compared to the old version of the model, a more diverse variety of masks and a more reasonable training schedule have been adopted. This reduces bright spots/artifacts and mask information leakage. Additionally, the dataset has been restructured with multi-resolution control images (512~1536) instead of single resolution (512) for better robustness. |
21
- | Z-Image-Turbo-Fun-Controlnet-Tile-2.1-2601-8steps.safetensors | Compared to the old version of the model, a higher resolution was used for training, and a more reasonable training schedule was employed during distillation, which reduces bright spots/artifacts. |
22
- | Z-Image-Turbo-Fun-Controlnet-Union-2.1-lite-2601-8steps.safetensors | Uses the same training scheme as the 2601 version, but compared to the large version of the model, fewer layers have control added, resulting in weaker control conditions. This makes it suitable for larger control_context_scale values, and the generation results appear more natural. It is also suitable for lower-spec machines. |
23
- | Z-Image-Turbo-Fun-Controlnet-Tile-2.1-lite-2601-8steps.safetensors | Uses the same training scheme as the 2601 version, but compared to the large version of the model, fewer layers have control added, resulting in weaker control conditions. This makes it suitable for larger control_context_scale values, and the generation results appear more natural. It is also suitable for lower-spec machines. |
24
 
25
- ### b. Models Before 2601
26
  | Name | Description |
27
  |--|--|
28
- | Z-Image-Turbo-Fun-Controlnet-Union-2.1-8steps.safetensors | Based on version 2.1, the model was distilled using an 8-step distillation algorithm. 8-step prediction is recommended. Compared to version 2.1, when using 8-step prediction, the images are clearer and the composition is more reasonable. |
29
- | Z-Image-Turbo-Fun-Controlnet-Tile-2.1-8steps.safetensors | A Tile model trained on high-definition datasets that can be used for super-resolution, with a maximum training resolution of 2048x2048. The model was distilled using an 8-step distillation algorithm, and 8-step prediction is recommended. |
30
- | Z-Image-Turbo-Fun-Controlnet-Union-2.1.safetensors | A retrained model after fixing the typo in version 2.0, with faster single-step speed. Similar to version 2.0, the model lost some of its acceleration capability after training, thus requiring more steps. |
31
- | Z-Image-Turbo-Fun-Controlnet-Union-2.0.safetensors | ControlNet weights for Z-Image-Turbo. Compared to version 1.0, it adds modifications to more layers and was trained for a longer time. However, due to a typo in the code, the layer blocks were forwarded twice, resulting in slower speed. The model supports multiple control conditions such as Canny, Depth, Pose, MLSD, etc. Additionally, the model lost some of its acceleration capability after training, thus requiring more steps. |
32
 
33
  ## Model Features
34
- - This ControlNet is added on 15 layer blocks and 2 refiner layer blocks (Lite models are added on 3 layer blocks and 2 refiner blocks). It supports multiple control conditionsincluding Canny, HED, Depth, Pose and MLSD can be used like a standard ControlNet.
35
- - Inpainting mode is also supported. When using inpaint mode, please use a larger control_context_scale, as this will result in better image continuity.
36
- - Training Process:
37
- - 2.0: The model was trained from scratch for 70,000 steps on a dataset of 1 million high-quality images covering both general and human-centric content. Training was performed at 1328 resolution using BFloat16 precision, with a batch size of 64, a learning rate of 2e-5, and a text dropout ratio of 0.10.
38
- - 2.1: Version 2.1 is based on the version 2.0 weights and continued training for an additional 11,000 steps after the typo fix, using the same parameters and dataset.
39
- - 2.1-8-steps: Version 2.1-8-steps was obtained by training for 5,500 steps using an 8-step distillation algorithm based on version 2.1.
40
- - Note on Steps:
41
- - 2.0 and 2.1: As you increase the control strength (higher control_context_scale values), it's recommended to appropriately increase the number of inference steps to achieve better results and maintain generation quality. This is likely because the control model has not been distilled.
42
- - 2.1-8-steps: Just use 8 steps in inference.
43
- - You can adjust control_context_scale for stronger control and better detail preservation. For better stability, we highly recommend using a detailed prompt. The optimal range for control_context_scale is from 0.65 to 1.00.
44
- - During testing, in versions 2.0 and 2.1, we found that applying ControlNet to Z-Image-Turbo caused the model to lose its acceleration capability and produce blurry images. For detailed information on strength and step count testing, please refer to Scale Test Results. These results were generated using version 2.0. For strength and step testing, please refer to [Scale Test Results](#scale-test-results). This was obtained by generating with version 2.0.
45
 
46
  ## Results
47
  ### a. Difference between 2.1-8steps and 2.1-2601-8steps.
@@ -88,7 +94,7 @@ The old 8-steps model sometimes learned the mask information and tended to compl
88
 
89
  ### c. Generation Results With 2.1-lite-2601-8steps
90
 
91
- Uses the same training scheme as the 2601 version, but compared to the large version of the model, fewer layers have control added, resulting in weaker control conditions. This makes it suitable for larger control_context_scale values, and the generation results appear more natural. It is also suitable for lower-spec machines.
92
 
93
  <table border="0" style="width: 100%; text-align: left; margin-top: 20px;">
94
  <tr>
@@ -234,6 +240,19 @@ Uses the same training scheme as the 2601 version, but compared to the large ver
234
  </tr>
235
  </table>
236
 
 
 
 
 
 
 
 
 
 
 
 
 
 
237
  ## Inference
238
  Go to the VideoX-Fun repository for more details.
239
 
 
8
  [![Github](https://img.shields.io/badge/🎬%20Code-VideoX_Fun-blue)](https://github.com/aigc-apps/VideoX-Fun)
9
 
10
  ## Update
11
+ - **[2026.02.26]** Update to version 2602, with support for Gray Control.
12
+ - **[2026.01.12]** Update to version 2601, with support for Scribble Control. Added lite models (1.9GB, 5 layers). Retrained Control and Tile models with enriched mask varieties, improved training schedules, and multi-resolution control images (512~1536) to fix mask pattern leakage and large `control_context_scale` artifacts.
13
+ - **[2025.12.22]** Performed 8-step distillation on v2.1 to restore acceleration lost when applying ControlNet. Uploaded a tile model for super-resolution.
14
+ - **[2025.12.17]** Fixed v2.0 typo (`control_layers` used instead of `control_noise_refiner`), which caused double forward pass and slow inference. Speed restored in v2.1.
15
 
16
  ## Model Card
17
+ ### a. 2602 Models
18
+ | Name | Description |
19
+ |--|--|
20
+ | Z-Image-Turbo-Fun-Controlnet-Union-2.1-2602-8steps.safetensors | Supports multiple control conditions (Canny, Depth, Pose, MLSD, Hed, Scribble, and Gray).|
21
+ | Z-Image-Turbo-Fun-Controlnet-Union-2.1-lite-2602-8steps.safetensors | Same training scheme as the 2601 version, but with control applied to fewer layers. Supports multiple control conditions (Canny, Depth, Pose, MLSD, Hed, Scribble, and Gray). |
22
 
23
+ ### b. 2601 Models
24
  | Name | Description |
25
  |--|--|
26
+ | Z-Image-Turbo-Fun-Controlnet-Union-2.1-2601-8steps.safetensors | Compared to the old version, this model uses more diverse masks, a more reasonable training schedule, and multi-resolution control images (512–1536) instead of single resolution (512). This reduces artifacts and mask information leakage while improving robustness. Supports multiple control conditions (Canny, Depth, Pose, MLSD, Hed, and Scribble). |
27
+ | Z-Image-Turbo-Fun-Controlnet-Tile-2.1-2601-8steps.safetensors | Compared to the old version, uses higher training resolution and a more refined distillation schedule, reducing bright spots and artifacts. |
28
+ | Z-Image-Turbo-Fun-Controlnet-Union-2.1-lite-2601-8steps.safetensors | Same training scheme as the 2601 version, but with control applied to fewer layers, resulting in weaker control. This allows for larger control_context_scale values with more natural results, and is also better suited for lower-spec machines. Supports multiple control conditions (Canny, Depth, Pose, MLSD, Hed, and Scribble). |
29
+ | Z-Image-Turbo-Fun-Controlnet-Tile-2.1-lite-2601-8steps.safetensors | Same training scheme as the 2601 version, but with control applied to fewer layers, resulting in weaker control. Allows larger control_context_scale values with more natural results, and better suits lower-spec machines. |
30
 
31
+ ### c. Models Before 2601
32
  | Name | Description |
33
  |--|--|
34
+ | Z-Image-Turbo-Fun-Controlnet-Union-2.1-8steps.safetensors | Distilled from version 2.1 using an 8-step distillation algorithm. Compared to version 2.1, 8-step prediction yields clearer images with more reasonable composition. Supports Canny, Depth, Pose, MLSD, and Hed. |
35
+ | Z-Image-Turbo-Fun-Controlnet-Tile-2.1-8steps.safetensors | A Tile model trained on high-definition datasets (up to 2048×2048) for super-resolution, distilled using an 8-step algorithm. 8-step prediction is recommended. |
36
+ | Z-Image-Turbo-Fun-Controlnet-Union-2.1.safetensors | A retrained model fixing the typo in version 2.0, with faster single-step speed. Supports Canny, Depth, Pose, MLSD, and Hed. However, like version 2.0, some acceleration capability was lost during training, requiring more steps and cfg. |
37
+ | Z-Image-Turbo-Fun-Controlnet-Union-2.0.safetensors | ControlNet weights for Z-Image-Turbo. Compared to version 1.0, more layers are modified with longer training. However, a code typo caused layer blocks to forward twice, resulting in slower speed. Supports Canny, Depth, Pose, MLSD, and Hed. Some acceleration capability was lost during training, requiring more steps. |
38
 
39
  ## Model Features
40
+ - This ControlNet is applied to 15 layer blocks and 2 refiner layer blocks (Lite models: 3 layer blocks and 2 refiner layer blocks). It supports multiple control conditions including Canny, HED, Depth, Pose, and MLSD (supporting Scribble in 2601 models and Gray in 2602 models).
41
+ - Inpainting mode is also supported. For inpaint mode, use a larger `control_context_scale` for better image continuity.
42
+ - **Training Process:**
43
+ - **2.0:** Trained from scratch for 70,000 steps on 1M high-quality images (general and human-centric content) at 1328 resolution with BFloat16 precision, batch size 64, learning rate 2e-5, and text dropout ratio 0.10.
44
+ - **2.1:** Continued training from 2.0 weights for 11,000 additional steps after fixing a typo, using the same parameters and dataset.
45
+ - **2.1-8-steps:** Distilled from version 2.1 using an 8-step distillation algorithm for 5,500 steps.
46
+ - **Note on Steps:**
47
+ - **2.0 and 2.1:** Higher `control_context_scale` values may require more inference steps for better results, likely because the control model has not been distilled.
48
+ - **2.1-8-steps:** Use 8 steps for inference.
49
+ - Adjust `control_context_scale` (optimal range: 0.65–1.00) for stronger control and better detail preservation. A detailed prompt is highly recommended for stability.
50
+ - In versions 2.0 and 2.1, applying ControlNet to Z-Image-Turbo caused loss of acceleration capability and blurry images. For strength and step count testing details, refer to [Scale Test Results](#scale-test-results) (generated with version 2.0).
51
 
52
  ## Results
53
  ### a. Difference between 2.1-8steps and 2.1-2601-8steps.
 
94
 
95
  ### c. Generation Results With 2.1-lite-2601-8steps
96
 
97
+ Shares the same training scheme as the 2601 version, but with control applied to fewer layers, resulting in weaker control. This allows for larger control_context_scale values with more natural results, and is also better suited for lower-spec machines.
98
 
99
  <table border="0" style="width: 100%; text-align: left; margin-top: 20px;">
100
  <tr>
 
240
  </tr>
241
  </table>
242
 
243
+ ### e. Gray Control Results with 2602 Models
244
+
245
+ <table border="0" style="width: 100%; text-align: left; margin-top: 20px;">
246
+ <tr>
247
+ <td>Low Resolution</td>
248
+ <td>High Resolution</td>
249
+ </tr>
250
+ <tr>
251
+ <td><img src="asset/gray.jpg" width="100%" /></td>
252
+ <td><img src="results/gray.png" width="100%" /></td>
253
+ </tr>
254
+ </table>
255
+
256
  ## Inference
257
  Go to the VideoX-Fun repository for more details.
258
 
Z-Image-Turbo-Fun-Controlnet-Union-2.1-2602-8steps.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d1251cc7bc3486bc61d25c3be498ef394c31c85ddf4ee9137d2e933411f4a689
3
+ size 6712485600
Z-Image-Turbo-Fun-Controlnet-Union-2.1-lite-2602-8steps.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:3ea098db9bd145be525c7e2366920b6d76c5ffd46b3d7aa8169bbc943fdaee35
3
+ size 2016627488
asset/gray.jpg ADDED

Git LFS Details

  • SHA256: 6bd84884bc99e86aa46618bf182d1dbcb5c6ec41fbd78bd6cbad725e44d5b179
  • Pointer size: 132 Bytes
  • Size of remote file: 1.06 MB
results/gray.png ADDED

Git LFS Details

  • SHA256: 50b3bf7cd171c58f22d020b8b00b84273577e2386366e41f5597f68e2309f58f
  • Pointer size: 132 Bytes
  • Size of remote file: 2.91 MB