Upload folder using huggingface_hub

Browse files

Files changed (14) hide show

.gitattributes +11 -0
README.md +102 -145
assets/KANDINSKY_LOGO_1_BLACK.png +0 -0
assets/generation_examples/images/1.jpg +3 -0
assets/generation_examples/images/2.jpg +3 -0
assets/generation_examples/images/3.jpg +3 -0
assets/generation_examples/images/4.jpg +3 -0
assets/generation_examples/images/5.jpg +3 -0
assets/generation_examples/images/6.jpg +3 -0
assets/generation_examples/images/7.jpg +3 -0
assets/generation_examples/images/8.jpg +3 -0
assets/generation_examples/images/9.jpg +3 -0
assets/sbs_edit.png +3 -0
assets/sbs_image.png +3 -0

.gitattributes CHANGED Viewed

@@ -42,3 +42,14 @@ assets/sbs/kandinsky_5_video_lite_vs_wan_2.1_14B.jpg filter=lfs diff=lfs merge=l
 assets/sbs/kandinsky_5_video_lite_vs_wan_2.2_5B.jpg filter=lfs diff=lfs merge=lfs -text
 assets/sbs/kandinsky_5_video_lite_vs_wan_2.2_A14B.jpg filter=lfs diff=lfs merge=lfs -text
 assets/vbench.png filter=lfs diff=lfs merge=lfs -text

 assets/sbs/kandinsky_5_video_lite_vs_wan_2.2_5B.jpg filter=lfs diff=lfs merge=lfs -text
 assets/sbs/kandinsky_5_video_lite_vs_wan_2.2_A14B.jpg filter=lfs diff=lfs merge=lfs -text
 assets/vbench.png filter=lfs diff=lfs merge=lfs -text
+assets/generation_examples/images/1.jpg filter=lfs diff=lfs merge=lfs -text
+assets/generation_examples/images/2.jpg filter=lfs diff=lfs merge=lfs -text
+assets/generation_examples/images/3.jpg filter=lfs diff=lfs merge=lfs -text
+assets/generation_examples/images/4.jpg filter=lfs diff=lfs merge=lfs -text
+assets/generation_examples/images/5.jpg filter=lfs diff=lfs merge=lfs -text
+assets/generation_examples/images/6.jpg filter=lfs diff=lfs merge=lfs -text
+assets/generation_examples/images/7.jpg filter=lfs diff=lfs merge=lfs -text
+assets/generation_examples/images/8.jpg filter=lfs diff=lfs merge=lfs -text
+assets/generation_examples/images/9.jpg filter=lfs diff=lfs merge=lfs -text
+assets/sbs_edit.png filter=lfs diff=lfs merge=lfs -text
+assets/sbs_image.png filter=lfs diff=lfs merge=lfs -text

README.md CHANGED Viewed

@@ -1,6 +1,7 @@
 ---
 license: apache-2.0
 ---
 <div align="center">
   <picture>
     <img src="assets/KANDINSKY_LOGO_1_BLACK.png">
@@ -8,177 +9,133 @@ license: apache-2.0
 </div>
 <div align="center">
-  <a href="https://habr.com/ru/companies/sberbank/articles/951800/">Habr</a> | <a href="https://ai-forever.github.io/Kandinsky-5/">Project Page</a> | Technical Report (soon) | <a href="https://github.com/ai-forever/Kandinsky-5">Original Github</a> | <a href="https://huggingface.co/collections/ai-forever/kandinsky-50-t2v-lite-diffusers-68dd73ebac816748ed79d6cb"> 🤗 Diffusers</a>
 </div>
 -----
-<h1>Kandinsky 5.0 T2V Lite - Diffusers</h1>
-This repository provides the 🤗 Diffusers integration for Kandinsky 5.0 T2V Lite - a lightweight video generation model (2B parameters) that ranks #1 among open-source models in its class.
-## Project Updates
-- 🔥 **2025/09/29**: We have open-sourced `Kandinsky 5.0 T2V Lite` a lite (2B parameters) version of `Kandinsky 5.0 Video` text-to-video generation model.
-- 🚀 **Diffusers Integration**: Now available with easy-to-use 🤗 Diffusers pipeline!
-## Kandinsky 5.0 T2V Lite
-Kandinsky 5.0 T2V Lite is a lightweight video generation model (2B parameters) that ranks #1 among open-source models in its class. It outperforms larger Wan models (5B and 14B) and offers the best understanding of Russian concepts in the open-source ecosystem.
-We provide 8 model variants, each optimized for different use cases:
-* **SFT model** — delivers the highest generation quality
-* **CFG-distilled** — runs 2× faster
-* **Diffusion-distilled** — enables low-latency generation with minimal quality loss (6× faster)
-* **Pretrain model** — designed for fine-tuning by researchers and enthusiasts
-## Basic Usage
 ```python
 import torch
-from diffusers import Kandinsky5T2VPipeline
-from diffusers.utils import export_to_video
 # Load the pipeline
-pipe = Kandinsky5T2VPipeline.from_pretrained(
-    "ai-forever/Kandinsky-5.0-T2V-Lite-sft-5s-Diffusers",
-    torch_dtype=torch.bfloat16
-)
-pipe = pipe.to("cuda")
-# Generate video
-prompt = "A cat and a dog baking a cake together in a kitchen."
-negative_prompt = "Static, 2D cartoon, cartoon, 2d animation, paintings, images, worst quality, low quality, ugly, deformed, walking backwards"
 output = pipe(
     prompt=prompt,
-    negative_prompt=negative_prompt,
-    height=512,
-    width=768,
-    num_frames=121,
     num_inference_steps=50,
-    guidance_scale=5.0,
-).frames[0]
-## Save the video
-export_to_video(output, "output.mp4", fps=24, quality=9)
 ```
-## Using Different Model Variants
-```python
-import torch
-from diffusers import Kandinsky5T2VPipeline
-# 5s SFT model (highest quality)
-pipe_sft = Kandinsky5T2VPipeline.from_pretrained(
-    "ai-forever/Kandinsky-5.0-T2V-Lite-sft-5s-Diffusers",
-    torch_dtype=torch.bfloat16
-)
-# 5s Distilled 16-step model (fastest)
-pipe_distill = Kandinsky5T2VPipeline.from_pretrained(
-    "ai-forever/Kandinsky-5.0-T2V-Lite-distilled16steps-5s-Diffusers",
-    torch_dtype=torch.bfloat16
-)
-# 5s No-CFG model (balanced speed/quality)
-pipe_nocfg = Kandinsky5T2VPipeline.from_pretrained(
-    "ai-forever/Kandinsky-5.0-T2V-Lite-nocfg-5s-Diffusers",
-    torch_dtype=torch.bfloat16
-)
-# 5s Pretrain model (most diverse)
-pipe_pretrain = Kandinsky5T2VPipeline.from_pretrained(
-    "ai-forever/Kandinsky-5.0-T2V-Lite-pretrain-5s-Diffusers",
-    torch_dtype=torch.bfloat16
-)
-# 10s SFT model (highest quality)
-pipe_sft = Kandinsky5T2VPipeline.from_pretrained(
-    "ai-forever/Kandinsky-5.0-T2V-Lite-sft-10s-Diffusers",
-    torch_dtype=torch.bfloat16
-)
-# 10s Distilled 16-step model (fastest)
-pipe_distill = Kandinsky5T2VPipeline.from_pretrained(
-    "ai-forever/Kandinsky-5.0-T2V-Lite-distilled16steps-10s-Diffusers",
-    torch_dtype=torch.bfloat16
-)
-# 10s No-CFG model (balanced speed/quality)
-pipe_nocfg = Kandinsky5T2VPipeline.from_pretrained(
-    "ai-forever/Kandinsky-5.0-T2V-Lite-nocfg-10s-Diffusers",
-    torch_dtype=torch.bfloat16
-)
-# 10s Pretrain model (most diverse)
-pipe_pretrain = Kandinsky5T2VPipeline.from_pretrained(
-    "ai-forever/Kandinsky-5.0-T2V-Lite-pretrain-10s-Diffusers",
-    torch_dtype=torch.bfloat16
-)
-```
-## Architecture
-Latent diffusion pipeline with Flow Matching.
-Diffusion Transformer (DiT) as the main generative backbone with cross-attention to text embeddings.
-Qwen2.5-VL and CLIP provides text embeddings
-HunyuanVideo 3D VAE encodes/decodes video into a latent space
-DiT is the main generative module using cross-attention to condition on text
-<div align="center">
-  <img width="1600" height="477" alt="Pipeline Architecture" src="https://github.com/user-attachments/assets/17fc2eb5-05e3-4591-9ec6-0f6e1ca397b3" />
-</div>
-<div align="center">
-  <img width="800" height="406" alt="Model Architecture" src="https://github.com/user-attachments/assets/f3006742-e261-4c39-b7dc-e39330be9a09" />
-</div>
-## Examples
-Kandinsky 5.0 T2V Lite SFT
-<table border="0" style="width: 200; text-align: left; margin-top: 20px;"> <tr> <td> <video src="https://github.com/user-attachments/assets/bc38821b-f9f1-46db-885f-1f70464669eb" width=200 controls autoplay loop></video> </td> <td> <video src="https://github.com/user-attachments/assets/9f64c940-4df8-4c51-bd81-a05de8e70fc3" width=200 controls autoplay loop></video> </td> <tr> <td> <video src="https://github.com/user-attachments/assets/77dd417f-e0bf-42bd-8d80-daffcd054add" width=200 controls autoplay loop></video> </td> <td> <video src="https://github.com/user-attachments/assets/385a0076-f01c-4663-aa46-6ce50352b9ed" width=200 controls autoplay loop></video> </td> <tr> <td> <video src="https://github.com/user-attachments/assets/7c1bcb31-cc7d-4385-9a33-2b0cc28393dd" width=200 controls autoplay loop></video> </td> <td> <video src="https://github.com/user-attachments/assets/990a8a0b-2df1-4bbc-b2e3-2859b6f1eea6" width=200 controls autoplay loop></video> </td> </tr> </table>
-Kandinsky 5.0 T2V Lite Distill
-<table border="0" style="width: 200; text-align: left; margin-top: 20px;"> <tr> <td> <video src="https://github.com/user-attachments/assets/861342f9-f576-4083-8a3b-94570a970d58" width=200 controls autoplay loop></video> </td> <td> <video src="https://github.com/user-attachments/assets/302e4e7d-781d-4a58-9b10-8c473d469c4b" width=200 controls autoplay loop></video> </td> <tr> <td> <video src="https://github.com/user-attachments/assets/3e70175c-40e5-4aec-b506-38006fe91a76" width=200 controls autoplay loop></video> </td> <td> <video src="https://github.com/user-attachments/assets/b7da85f7-8b62-4d46-9460-7f0e505de810" width=200 controls autoplay loop></video> </td> </table>
-Results
-Side-by-Side Evaluation
-The evaluation is based on the expanded prompts from the Movie Gen benchmark.
-<table border="0" style="width: 400; text-align: left; margin-top: 20px;"> <tr> <td> <img src="assets/sbs/kandinsky_5_video_lite_vs_sora.jpg" width=400 ></img> </td> <td> <img src="assets/sbs/kandinsky_5_video_lite_vs_wan_2.1_14B.jpg" width=400 ></img> </td> <tr> <td> <img src="assets/sbs/kandinsky_5_video_lite_vs_wan_2.2_5B.jpg" width=400 ></img> </td> <td> <img src="assets/sbs/kandinsky_5_video_lite_vs_wan_2.2_A14B.jpg" width=400 ></img> </td> <tr> <td> <img src="assets/sbs/kandinsky_5_video_lite_vs_wan_2.1_1.3B.jpg" width=400 ></img> </td> </table>
-Distill Side-by-Side Evaluation
-<table border="0" style="width: 400; text-align: left; margin-top: 20px;"> <tr> <td> <img src="assets/sbs/kandinsky_5_video_lite_5s_vs_kandinsky_5_video_lite_distill_5s.jpg" width=400 ></img> </td> <td> <img src="assets/sbs/kandinsky_5_video_lite_10s_vs_kandinsky_5_video_lite_distill_10s.jpg" width=400 ></img> </td> </table>
-VBench Results
-<div align="center"> <picture> <img src="assets/vbench.png"> </picture> </div>
-Beta Testing
-You can apply to participate in the beta testing of the Kandinsky Video Lite via the telegram bot.
 ```bibtex
 @misc{kandinsky2025,
-    author = {Alexey Letunovskiy, Maria Kovaleva, Ivan Kirillov, Lev Novitskiy, Denis Koposov,
-              Dmitrii Mikhailov, Anna Averchenkova, Andrey Shutkin, Julia Agafonova, Olga Kim,
-              Anastasiia Kargapoltseva, Nikita Kiselev, Vladimir Arkhipkin, Vladimir Korviakov,
-              Nikolai Gerasimenko, Denis Parkhomenko, Anna Dmitrienko, Anastasia Maltseva,
-              Kirill Chernyshev, Ilia Vasiliev, Viacheslav Vasilev, Vladimir Polovnikov,
-              Yury Kolabushin, Alexander Belykh, Mikhail Mamaev, Anastasia Aliaskina,
-              Tatiana Nikulina, Polina Gavrilova, Denis Dimitrov},
     title = {Kandinsky 5.0: A family of diffusion models for Video & Image generation},
-    howpublished = {\url{https://github.com/ai-forever/Kandinsky-5}},
     year = 2025
 }
-@misc{mikhailov2025nablanablaneighborhoodadaptiveblocklevel,
-      title={$\nabla$NABLA: Neighborhood Adaptive Block-Level Attention},
-      author={Dmitrii Mikhailov and Aleksey Letunovskiy and Maria Kovaleva and Vladimir Arkhipkin
-              and Vladimir Korviakov and Vladimir Polovnikov and Viacheslav Vasilev
-              and Evelina Sidorova and Denis Dimitrov},
-      year={2025},
-      eprint={2507.13546},
-      archivePrefix={arXiv},
-      primaryClass={cs.CV},
-      url={https://arxiv.org/abs/2507.13546},
-}
-```

 ---
 license: apache-2.0
 ---
 <div align="center">
   <picture>
     <img src="assets/KANDINSKY_LOGO_1_BLACK.png">
 </div>
 <div align="center">
+  <a href="https://habr.com/ru/companies/sberbank/articles/951800/">Habr</a> |
+  <a href="https://ai-forever.github.io/Kandinsky-5/">Project Page</a> |
+  <a href="https://github.com/kandinskylab/kandinsky-5/blob/main/paper.pdf">Technical Report</a> |
+  <a href="https://github.com/ai-forever/Kandinsky-5">Original GitHub</a> |
+  <a href="https://huggingface.co/collections/kandinskylab/kandinsky-50-image-lite-diffusers">🤗 Diffusers</a>
 </div>
 -----
+<h1>Kandinsky 5.0 T2I Lite SFT – Diffusers</h1>
+Kandinsky 5.0 is a family of diffusion models for video and image generation.
+Kandinsky 5.0 Image Lite is a lightweight text-to-image (T2I) generation model with 6B parameters.
+The model introduces several key innovations:
+- **Latent diffusion pipeline** with **Flow Matching** for improved training stability
+- **Diffusion Transformer (DiT)** as the main generative backbone with cross-attention to text embeddings
+- Dual text encoding using **Qwen2.5-VL** and **CLIP** for comprehensive text understanding
+- **Flux VAE** for efficient image encoding and decoding
+The original codebase can be found at [kandinskylab/Kandinsky-5](https://github.com/kandinskylab/Kandinsky-5).
+## Available Models
+Kandinsky 5.0 Image Lite:
+| model_id | Description | Use Cases |
+|------------|-------------|-----------|
+| **<a href="https://huggingface.co/kandinskylab/Kandinsky-5.0-T2I-Lite-sft-Diffusers">kandinskylab/Kandinsky-5.0-T2I-Lite-sft-Diffusers</a>** | 6B supervised fine-tuned text-to-image model | Highest generation quality |
+| **<a href="https://huggingface.co/kandinskylab/Kandinsky-5.0-I2I-Lite-sft-Diffusers">kandinskylab/Kandinsky-5.0-I2I-Lite-sft-Diffusers</a>** | 6B supervised fine-tuned image-to-image editing model | Highest generation quality |
+| **<a href="https://huggingface.co/kandinskylab/Kandinsky-5.0-T2I-Lite-pretrain-Diffusers">kandinskylab/Kandinsky-5.0-T2I-Lite-pretrain-Diffusers</a>** | 6B base pretrained text-to-image model | Research and fine-tuning |
+| **<a href="https://huggingface.co/kandinskylab/Kandinsky-5.0-I2I-Lite-pretrain-Diffusers">kandinskylab/Kandinsky-5.0-I2I-Lite-pretrain-Diffusers</a>** | 6B base pretrained image-to-image editing model | Research and fine-tuning |
+## Examples
+<table border="0" style="width: 90%; text-align: left; margin-top: 20px;">
+  <tr>
+      <td>
+          <img src="assets/generation_examples/images/1.jpg" width=90% >
+      </td>
+      <td>
+          <img src="assets/generation_examples/images/2.jpg" width=90% >
+      </td>
+      <td>
+          <img src="assets/generation_examples/images/9.jpg" width=90% >
+      </td>
+  <tr>
+</table>
+<table border="0" style="width: 90%; text-align: left; margin-top: 10px;">
+      <td>
+          <img src="assets/generation_examples/images/4.jpg" width=90% >
+      </td>
+      <td>
+          <img src="assets/generation_examples/images/5.jpg" width=90% >
+      </td>
+      <td>
+          <img src="assets/generation_examples/images/3.jpg" width=90% >
+      </td>
+</table>
+<table border="0" style="width: 90%; text-align: left; margin-top: 10px;">
+      <td>
+          <img src="assets/generation_examples/images/7.jpg" width=90% >
+      </td>
+      <td>
+          <img src="assets/generation_examples/images/8.jpg" width=90% >
+      </td>
+      <td>
+          <img src="assets/generation_examples/images/6.jpg" width=90% >
+      </td>
+</table>
+## Kandinsky5T2IPipeline Usage Example
 ```python
 import torch
+from diffusers import Kandinsky5T2IPipeline
 # Load the pipeline
+model_id = "kandinskylab/Kandinsky-5.0-T2I-Lite-sft-Diffusers"
+pipe = Kandinsky5T2IPipeline.from_pretrained(model_id)
+_ = pipe.to(device="cuda", dtype=torch.bfloat16)
+# Generate image
+prompt = "A fluffy, expressive cat wearing a bright red hat with a soft, slightly textured fabric. The hat should look cozy and well-fitted on the cat’s head. On the front of the hat, add clean, bold white text that reads “SWEET”, clearly visible and neatly centered. Ensure the overall lighting highlights the hat’s color and the cat’s fur details."
 output = pipe(
     prompt=prompt,
+    negative_prompt="",
+    height=1024,
+    width=1024,
     num_inference_steps=50,
+    guidance_scale=3.5,
+).image[0]
 ```
+## Results
+<table style="width:100%; text-align:center; margin-top:20px;">
+  <tr>
+    <td>
+      <img src="assets/sbs_image.png" width="100%">
+    </td>
+    <td>
+      <img src="assets/sbs_edit.png" width="100%">
+    </td>
+  </tr>
+  <tr>
+    <td style="font-size: 1.1em; font-weight: 500; padding-top: 6px;">
+      Side-by-side evaluation of T2I on PartiPrompts with extended prompts
+    </td>
+    <td style="font-size: 1.1em; font-weight: 500; padding-top: 6px;">
+      Side-by-side evaluation of I2I on the Flux Kontext benchmark with extended prompts
+    </td>
+  </tr>
+</table>
+## Citation
 ```bibtex
 @misc{kandinsky2025,
+    author = {Alexander Belykh and Alexander Varlamov and Alexey Letunovskiy and Anastasia Aliaskina and Anastasia Maltseva and Anastasiia Kargapoltseva and Andrey Shutkin and Anna Averchenkova and Anna Dmitrienko and Bulat Akhmatov and Denis Dimitrov and Denis Koposov and Denis Parkhomenko and Dmitrii and Ilya Vasiliev and Ivan Kirillov and Julia Agafonova and Kirill Chernyshev and Kormilitsyn Semen and Lev Novitskiy and Maria Kovaleva and Mikhail Mamaev and Mikhailov and Nikita Kiselev and Nikita Osterov and Nikolai Gerasimenko and Nikolai Vaulin and Olga Kim and Olga Vdovchenko and Polina Gavrilova and Polina Mikhailova and Tatiana Nikulina and Viacheslav Vasilev and Vladimir Arkhipkin and Vladimir Korviakov and Vladimir Polovnikov and Yury Kolabushin},
     title = {Kandinsky 5.0: A family of diffusion models for Video & Image generation},
+    howpublished = {\url{https://github.com/kandinskylab/Kandinsky-5}},
     year = 2025
 }
+```