File size: 9,268 Bytes
c5c2abc 9d7faff c5c2abc 5a0848b c5c2abc |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 |
---
license: apache-2.0
---
<div align="center">
<picture>
<img src="assets/KANDINSKY_LOGO_1_BLACK.png">
</picture>
</div>
<div align="center">
<a href="https://habr.com/ru/companies/sberbank/articles/951800/">Habr</a> | <a href="https://ai-forever.github.io/Kandinsky-5/">Project Page</a> | Technical Report (soon) | <a href="https://github.com/ai-forever/Kandinsky-5">Original Github</a> | <a href="https://huggingface.co/collections/ai-forever/kandinsky-50-t2v-lite-diffusers-68dd73ebac816748ed79d6cb"> π€ Diffusers</a>
</div>
-----
<h1>Kandinsky 5.0 T2V Lite - Diffusers</h1>
This repository provides the π€ Diffusers integration for Kandinsky 5.0 T2V Lite - a lightweight video generation model (2B parameters) that ranks #1 among open-source models in its class.
## Project Updates
- π₯ **2025/09/29**: We have open-sourced `Kandinsky 5.0 T2V Lite` a lite (2B parameters) version of `Kandinsky 5.0 Video` text-to-video generation model.
- π **Diffusers Integration**: Now available with easy-to-use π€ Diffusers pipeline!
## Kandinsky 5.0 T2V Lite
Kandinsky 5.0 T2V Lite is a lightweight video generation model (2B parameters) that ranks #1 among open-source models in its class. It outperforms larger Wan models (5B and 14B) and offers the best understanding of Russian concepts in the open-source ecosystem.
We provide 8 model variants, each optimized for different use cases:
* **SFT model** β delivers the highest generation quality
* **CFG-distilled** β runs 2Γ faster
* **Diffusion-distilled** β enables low-latency generation with minimal quality loss (6Γ faster)
* **Pretrain model** β designed for fine-tuning by researchers and enthusiasts
## Basic Usage
```python
import torch
from diffusers import Kandinsky5T2VPipeline
from diffusers.utils import export_to_video
# Load the pipeline
pipe = Kandinsky5T2VPipeline.from_pretrained(
"ai-forever/Kandinsky-5.0-T2V-Lite-nocfg-10s-Diffusers",
torch_dtype=torch.bfloat16
)
pipe = pipe.to("cuda")
pipe.transformer.set_attention_backend("flex")
pipe.transformer.compile(mode="max-autotune-no-cudagraphs", dynamic=True)
# Generate video
prompt = "A cat and a dog baking a cake together in a kitchen."
negative_prompt = "Static, 2D cartoon, cartoon, 2d animation, paintings, images, worst quality, low quality, ugly, deformed, walking backwards"
output = pipe(
prompt=prompt,
negative_prompt=negative_prompt,
height=512,
width=768,
num_frames=241,
num_inference_steps=50,
guidance_scale=1.0,
).frames[0]
## Save the video
export_to_video(output, "output.mp4", fps=24, quality=9)
```
## Using Different Model Variants
```python
import torch
from diffusers import Kandinsky5T2VPipeline
# 5s SFT model (highest quality)
pipe_sft = Kandinsky5T2VPipeline.from_pretrained(
"ai-forever/Kandinsky-5.0-T2V-Lite-sft-5s-Diffusers",
torch_dtype=torch.bfloat16
)
# 5s Distilled 16-step model (fastest)
pipe_distill = Kandinsky5T2VPipeline.from_pretrained(
"ai-forever/Kandinsky-5.0-T2V-Lite-distilled16steps-5s-Diffusers",
torch_dtype=torch.bfloat16
)
# 5s No-CFG model (balanced speed/quality)
pipe_nocfg = Kandinsky5T2VPipeline.from_pretrained(
"ai-forever/Kandinsky-5.0-T2V-Lite-nocfg-5s-Diffusers",
torch_dtype=torch.bfloat16
)
# 5s Pretrain model (most diverse)
pipe_pretrain = Kandinsky5T2VPipeline.from_pretrained(
"ai-forever/Kandinsky-5.0-T2V-Lite-pretrain-5s-Diffusers",
torch_dtype=torch.bfloat16
)
# 10s SFT model (highest quality)
pipe_sft = Kandinsky5T2VPipeline.from_pretrained(
"ai-forever/Kandinsky-5.0-T2V-Lite-sft-10s-Diffusers",
torch_dtype=torch.bfloat16
)
# 10s Distilled 16-step model (fastest)
pipe_distill = Kandinsky5T2VPipeline.from_pretrained(
"ai-forever/Kandinsky-5.0-T2V-Lite-distilled16steps-10s-Diffusers",
torch_dtype=torch.bfloat16
)
# 10s No-CFG model (balanced speed/quality)
pipe_nocfg = Kandinsky5T2VPipeline.from_pretrained(
"ai-forever/Kandinsky-5.0-T2V-Lite-nocfg-10s-Diffusers",
torch_dtype=torch.bfloat16
)
# 10s Pretrain model (most diverse)
pipe_pretrain = Kandinsky5T2VPipeline.from_pretrained(
"ai-forever/Kandinsky-5.0-T2V-Lite-pretrain-10s-Diffusers",
torch_dtype=torch.bfloat16
)
```
## Architecture
Latent diffusion pipeline with Flow Matching.
Diffusion Transformer (DiT) as the main generative backbone with cross-attention to text embeddings.
Qwen2.5-VL and CLIP provides text embeddings
HunyuanVideo 3D VAE encodes/decodes video into a latent space
DiT is the main generative module using cross-attention to condition on text
<div align="center">
<img width="1600" height="477" alt="Pipeline Architecture" src="https://github.com/user-attachments/assets/17fc2eb5-05e3-4591-9ec6-0f6e1ca397b3" />
</div>
<div align="center">
<img width="800" height="406" alt="Model Architecture" src="https://github.com/user-attachments/assets/f3006742-e261-4c39-b7dc-e39330be9a09" />
</div>
## Examples
Kandinsky 5.0 T2V Lite SFT
<table border="0" style="width: 200; text-align: left; margin-top: 20px;"> <tr> <td> <video src="https://github.com/user-attachments/assets/bc38821b-f9f1-46db-885f-1f70464669eb" width=200 controls autoplay loop></video> </td> <td> <video src="https://github.com/user-attachments/assets/9f64c940-4df8-4c51-bd81-a05de8e70fc3" width=200 controls autoplay loop></video> </td> <tr> <td> <video src="https://github.com/user-attachments/assets/77dd417f-e0bf-42bd-8d80-daffcd054add" width=200 controls autoplay loop></video> </td> <td> <video src="https://github.com/user-attachments/assets/385a0076-f01c-4663-aa46-6ce50352b9ed" width=200 controls autoplay loop></video> </td> <tr> <td> <video src="https://github.com/user-attachments/assets/7c1bcb31-cc7d-4385-9a33-2b0cc28393dd" width=200 controls autoplay loop></video> </td> <td> <video src="https://github.com/user-attachments/assets/990a8a0b-2df1-4bbc-b2e3-2859b6f1eea6" width=200 controls autoplay loop></video> </td> </tr> </table>
Kandinsky 5.0 T2V Lite Distill
<table border="0" style="width: 200; text-align: left; margin-top: 20px;"> <tr> <td> <video src="https://github.com/user-attachments/assets/861342f9-f576-4083-8a3b-94570a970d58" width=200 controls autoplay loop></video> </td> <td> <video src="https://github.com/user-attachments/assets/302e4e7d-781d-4a58-9b10-8c473d469c4b" width=200 controls autoplay loop></video> </td> <tr> <td> <video src="https://github.com/user-attachments/assets/3e70175c-40e5-4aec-b506-38006fe91a76" width=200 controls autoplay loop></video> </td> <td> <video src="https://github.com/user-attachments/assets/b7da85f7-8b62-4d46-9460-7f0e505de810" width=200 controls autoplay loop></video> </td> </table>
Results
Side-by-Side Evaluation
The evaluation is based on the expanded prompts from the Movie Gen benchmark.
<table border="0" style="width: 400; text-align: left; margin-top: 20px;"> <tr> <td> <img src="assets/sbs/kandinsky_5_video_lite_vs_sora.jpg" width=400 ></img> </td> <td> <img src="assets/sbs/kandinsky_5_video_lite_vs_wan_2.1_14B.jpg" width=400 ></img> </td> <tr> <td> <img src="assets/sbs/kandinsky_5_video_lite_vs_wan_2.2_5B.jpg" width=400 ></img> </td> <td> <img src="assets/sbs/kandinsky_5_video_lite_vs_wan_2.2_A14B.jpg" width=400 ></img> </td> <tr> <td> <img src="assets/sbs/kandinsky_5_video_lite_vs_wan_2.1_1.3B.jpg" width=400 ></img> </td> </table>
Distill Side-by-Side Evaluation
<table border="0" style="width: 400; text-align: left; margin-top: 20px;"> <tr> <td> <img src="assets/sbs/kandinsky_5_video_lite_5s_vs_kandinsky_5_video_lite_distill_5s.jpg" width=400 ></img> </td> <td> <img src="assets/sbs/kandinsky_5_video_lite_10s_vs_kandinsky_5_video_lite_distill_10s.jpg" width=400 ></img> </td> </table>
VBench Results
<div align="center"> <picture> <img src="assets/vbench.png"> </picture> </div>
Beta Testing
You can apply to participate in the beta testing of the Kandinsky Video Lite via the telegram bot.
```bibtex
@misc{kandinsky2025,
author = {Alexey Letunovskiy, Maria Kovaleva, Ivan Kirillov, Lev Novitskiy, Denis Koposov,
Dmitrii Mikhailov, Anna Averchenkova, Andrey Shutkin, Julia Agafonova, Olga Kim,
Anastasiia Kargapoltseva, Nikita Kiselev, Vladimir Arkhipkin, Vladimir Korviakov,
Nikolai Gerasimenko, Denis Parkhomenko, Anna Dmitrienko, Anastasia Maltseva,
Kirill Chernyshev, Ilia Vasiliev, Viacheslav Vasilev, Vladimir Polovnikov,
Yury Kolabushin, Alexander Belykh, Mikhail Mamaev, Anastasia Aliaskina,
Tatiana Nikulina, Polina Gavrilova, Denis Dimitrov},
title = {Kandinsky 5.0: A family of diffusion models for Video & Image generation},
howpublished = {\url{https://github.com/ai-forever/Kandinsky-5}},
year = 2025
}
@misc{mikhailov2025nablanablaneighborhoodadaptiveblocklevel,
title={$\nabla$NABLA: Neighborhood Adaptive Block-Level Attention},
author={Dmitrii Mikhailov and Aleksey Letunovskiy and Maria Kovaleva and Vladimir Arkhipkin
and Vladimir Korviakov and Vladimir Polovnikov and Viacheslav Vasilev
and Evelina Sidorova and Denis Dimitrov},
year={2025},
eprint={2507.13546},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2507.13546},
}
``` |