Improve model card: Update pipeline_tag, add library_name, and include NABLA paper details

#10
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +33 -21
README.md CHANGED
@@ -1,15 +1,16 @@
1
  ---
2
- license: apache-2.0
3
  language:
4
  - en
5
  - zh
 
 
6
  tags:
7
  - video generation
8
  - video-to-video editing
9
  - refernce-to-video
10
-
11
- pipeline_tag: image-to-video
12
  ---
 
13
  # Wan2.1
14
 
15
  <p align="center">
@@ -31,6 +32,17 @@ In this repository, we present **Wan2.1**, a comprehensive and open suite of vid
31
  - πŸ‘ **Visual Text Generation**: **Wan2.1** is the first video model capable of generating both Chinese and English text, featuring robust text generation that enhances its practical applications.
32
  - πŸ‘ **Powerful Video VAE**: **Wan-VAE** delivers exceptional efficiency and performance, encoding and decoding 1080P videos of any length while preserving temporal information, making it an ideal foundation for video and image generation.
33
 
 
 
 
 
 
 
 
 
 
 
 
34
  ## Video Demos
35
 
36
  <div align="center">
@@ -106,18 +118,18 @@ pip install -r requirements.txt
106
 
107
  #### Model Download
108
 
109
- | Models | Download Link | Notes |
110
- |--------------|---------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------|
111
- | T2V-14B | πŸ€— [Huggingface](https://huggingface.co/Wan-AI/Wan2.1-T2V-14B) πŸ€– [ModelScope](https://www.modelscope.cn/models/Wan-AI/Wan2.1-T2V-14B) | Supports both 480P and 720P
112
- | I2V-14B-720P | πŸ€— [Huggingface](https://huggingface.co/Wan-AI/Wan2.1-I2V-14B-720P) πŸ€– [ModelScope](https://www.modelscope.cn/models/Wan-AI/Wan2.1-I2V-14B-720P) | Supports 720P
113
- | I2V-14B-480P | πŸ€— [Huggingface](https://huggingface.co/Wan-AI/Wan2.1-I2V-14B-480P) πŸ€– [ModelScope](https://www.modelscope.cn/models/Wan-AI/Wan2.1-I2V-14B-480P) | Supports 480P
114
- | T2V-1.3B | πŸ€— [Huggingface](https://huggingface.co/Wan-AI/Wan2.1-T2V-1.3B) πŸ€– [ModelScope](https://www.modelscope.cn/models/Wan-AI/Wan2.1-T2V-1.3B) | Supports 480P
115
- | FLF2V-14B | πŸ€— [Huggingface](https://huggingface.co/Wan-AI/Wan2.1-FLF2V-14B-720P) πŸ€– [ModelScope](https://www.modelscope.cn/models/Wan-AI/Wan2.1-FLF2V-14B-720P) | Supports 720P
116
- | VACE-1.3B | πŸ€— [Huggingface](https://huggingface.co/Wan-AI/Wan2.1-VACE-1.3B) πŸ€– [ModelScope](https://www.modelscope.cn/models/Wan-AI/Wan2.1-VACE-1.3B) | Supports 480P
117
- | VACE-14B | πŸ€— [Huggingface](https://huggingface.co/Wan-AI/Wan2.1-VACE-14B) πŸ€– [ModelScope](https://www.modelscope.cn/models/Wan-AI/Wan2.1-VACE-14B) | Supports both 480P and 720P
118
-
119
- > πŸ’‘Note:
120
- > * The 1.3B model is capable of generating videos at 720P resolution. However, due to limited training at this resolution, the results are generally less stable compared to 480P. For optimal performance, we recommend using 480P resolution.
121
  > * For the first-last frame to video generation, we train our model primarily on Chinese text-video pairs. Therefore, we recommend using Chinese prompt to achieve better results.
122
 
123
 
@@ -190,7 +202,7 @@ python generate.py --task t2v-1.3B --size 832*480 --ckpt_dir ./Wan2.1-T2V-1.3B
190
 
191
  * Ulysess Strategy
192
 
193
- If you want to use [`Ulysses`](https://arxiv.org/abs/2309.14509) strategy, you should set `--ulysses_size $GPU_NUMS`. Note that the `num_heads` should be divisible by `ulysses_size` if you wish to use `Ulysess` strategy. For the 1.3B model, the `num_heads` is `12` which can't be divided by 8 (as most multi-GPU machines have 8 GPUs). Therefore, it is recommended to use `Ring Strategy` instead.
194
 
195
  * Ring Strategy
196
 
@@ -611,7 +623,7 @@ We test the computational efficiency of different **Wan2.1** models on different
611
 
612
  ## Introduction of Wan2.1
613
 
614
- **Wan2.1** is designed on the mainstream diffusion transformer paradigm, achieving significant advancements in generative capabilities through a series of innovations. These include our novel spatio-temporal variational autoencoder (VAE), scalable training strategies, large-scale data construction, and automated evaluation metrics. Collectively, these contributions enhance the model’s performance and versatility.
615
 
616
 
617
  ##### (1) 3D Variational Autoencoders
@@ -632,10 +644,10 @@ We propose a novel 3D causal VAE architecture, termed **Wan-VAE** specifically d
632
  </div>
633
 
634
 
635
- | Model | Dimension | Input Dimension | Output Dimension | Feedforward Dimension | Frequency Dimension | Number of Heads | Number of Layers |
636
- |--------|-----------|-----------------|------------------|-----------------------|---------------------|-----------------|------------------|
637
- | 1.3B | 1536 | 16 | 16 | 8960 | 256 | 12 | 30 |
638
- | 14B | 5120 | 16 | 16 | 13824 | 256 | 40 | 40 |
639
 
640
 
641
 
 
1
  ---
 
2
  language:
3
  - en
4
  - zh
5
+ license: apache-2.0
6
+ pipeline_tag: any-to-any
7
  tags:
8
  - video generation
9
  - video-to-video editing
10
  - refernce-to-video
11
+ library_name: diffusers
 
12
  ---
13
+
14
  # Wan2.1
15
 
16
  <p align="center">
 
32
  - πŸ‘ **Visual Text Generation**: **Wan2.1** is the first video model capable of generating both Chinese and English text, featuring robust text generation that enhances its practical applications.
33
  - πŸ‘ **Powerful Video VAE**: **Wan-VAE** delivers exceptional efficiency and performance, encoding and decoding 1080P videos of any length while preserving temporal information, making it an ideal foundation for video and image generation.
34
 
35
+ ## $
36
+ abla$NABLA: Neighborhood Adaptive Block-Level Attention
37
+
38
+ The **Wan2.1** model incorporates advanced attention mechanisms, including those detailed in the paper [$
39
+ abla$NABLA: Neighborhood Adaptive Block-Level Attention](https://huggingface.co/papers/2507.13546).
40
+
41
+ The abstract of the paper is the following:
42
+ "Recent progress in transformer-based architectures has demonstrated remarkable success in video generation tasks. However, the quadratic complexity of full attention mechanisms remains a critical bottleneck, particularly for high-resolution and long-duration video sequences. In this paper, we propose NABLA, a novel Neighborhood Adaptive Block-Level Attention mechanism that dynamically adapts to sparsity patterns in video diffusion transformers (DiTs). By leveraging block-wise attention with adaptive sparsity-driven threshold, NABLA reduces computational overhead while preserving generative quality. Our method does not require custom low-level operator design and can be seamlessly integrated with PyTorch's Flex Attention operator. Experiments demonstrate that NABLA achieves up to 2.7x faster training and inference compared to baseline almost without compromising quantitative metrics (CLIP score, VBench score, human evaluation score) and visual quality drop."
43
+
44
+ This technique contributes to the efficiency and performance of video generation within the Wan2.1 framework.
45
+
46
  ## Video Demos
47
 
48
  <div align="center">
 
118
 
119
  #### Model Download
120
 
121
+ | Models | Download Link | Notes |
122
+ |---|---|---|
123
+ | T2V-14B | πŸ€— [Huggingface](https://huggingface.co/Wan-AI/Wan2.1-T2V-14B) πŸ€– [ModelScope](https://www.modelscope.cn/models/Wan-AI/Wan2.1-T2V-14B) | Supports both 480P and 720P
124
+ | I2V-14B-720P | πŸ€— [Huggingface](https://huggingface.co/Wan-AI/Wan2.1-I2V-14B-720P) πŸ€– [ModelScope](https://www.modelscope.cn/models/Wan-AI/Wan2.1-I2V-14B-720P) | Supports 720P
125
+ | I2V-14B-480P | πŸ€— [Huggingface](https://huggingface.co/Wan-AI/Wan2.1-I2V-14B-480P) πŸ€– [ModelScope](https://www.modelscope.cn/models/Wan-AI/Wan2.1-I2V-14B-480P) | Supports 480P
126
+ | T2V-1.3B | πŸ€— [Huggingface](https://huggingface.co/Wan-AI/Wan2.1-T2V-1.3B) πŸ€– [ModelScope](https://www.modelscope.cn/models/Wan-AI/Wan2.1-T2V-1.3B) | Supports 480P
127
+ | FLF2V-14B | πŸ€— [Huggingface](https://huggingface.co/Wan-AI/Wan2.1-FLF2V-14B-720P) πŸ€– [ModelScope](https://www.modelscope.cn/models/Wan-AI/Wan2.1-FLF2V-14B-720P) | Supports 720P
128
+ | VACE-1.3B | πŸ€— [Huggingface](https://huggingface.co/Wan-AI/Wan2.1-VACE-1.3B) πŸ€– [ModelScope](https://www.modelscope.cn/models/Wan-AI/Wan2.1-VACE-1.3B) | Supports 480P
129
+ | VACE-14B | πŸ€— [Huggingface](https://huggingface.co/Wan-AI/Wan2.1-VACE-14B) πŸ€– [ModelScope](https://www.modelscope.cn/models/Wan-AI/Wan2.1-VACE-14B) | Supports both 480P and 720P
130
+
131
+ > πŸ’‘Note:
132
+ > * The 1.3B model is capable of generating videos at 720P resolution. However, due to limited training at this resolution, the results are generally less stable compared to 480P. For optimal performance, we recommend using 480P resolution.
133
  > * For the first-last frame to video generation, we train our model primarily on Chinese text-video pairs. Therefore, we recommend using Chinese prompt to achieve better results.
134
 
135
 
 
202
 
203
  * Ulysess Strategy
204
 
205
+ If you want to use [`Ulysses`](https://arxiv.org/abs/2309.14509) strategy, you should set `--ulysses_size $GPU_NUMS`. Note that the `num_heads` should be divisible by `ulysses_size` if you wish to use `Ulyess` strategy. For the 1.3B model, the `num_heads` is `12` which can't be divided by 8 (as most multi-GPU machines have 8 GPUs). Therefore, it is recommended to use `Ring Strategy` instead.
206
 
207
  * Ring Strategy
208
 
 
623
 
624
  ## Introduction of Wan2.1
625
 
626
+ **Wan2.1** is designed on the mainstream diffusion transformer paradigm, achieving significant advancements in generative capabilities through a series of innovations. These include our novel spatio-temporal variational autoencoder (VAE), scalable training strategies, large-scale data construction, and automated evaluation metrics. Collectively, these contributions enhance the model’s performance and versatility.
627
 
628
 
629
  ##### (1) 3D Variational Autoencoders
 
644
  </div>
645
 
646
 
647
+ | Model | Dimension | Input Dimension | Output Dimension | Feedforward Dimension | Frequency Dimension | Number of Heads | Number of Layers |
648
+ |---|---|---|---|---|---|---|---|
649
+ | 1.3B | 1536 | 16 | 16 | 8960 | 256 | 12 | 30 |
650
+ | 14B | 5120 | 16 | 16 | 13824 | 256 | 40 | 40 |
651
 
652
 
653