recoilme commited on 20 days ago

Commit

77f9f15

1 Parent(s): 3d9b043

1.6b

Browse files

Files changed (36) hide show

README.md +70 -15
girl.jpg +2 -2
media/girl.jpg +2 -2
media/result_grid.jpg +2 -2
model_index.json +2 -2
pipeline_sdxs.py +282 -172
processor/chat_template.jinja +154 -0
processor/processor_config.json +3 -0
processor/tokenizer.json +3 -0
processor/tokenizer_config.json +3 -0
samples/unet_384x704_0.jpg +2 -2
samples/unet_416x704_0.jpg +2 -2
samples/unet_448x704_0.jpg +2 -2
samples/unet_480x704_0.jpg +2 -2
samples/unet_512x704_0.jpg +2 -2
samples/unet_544x704_0.jpg +2 -2
samples/unet_576x704_0.jpg +2 -2
samples/unet_608x704_0.jpg +2 -2
samples/unet_640x704_0.jpg +2 -2
samples/unet_672x704_0.jpg +2 -2
samples/unet_704x384_0.jpg +2 -2
samples/unet_704x416_0.jpg +2 -2
samples/unet_704x448_0.jpg +2 -2
samples/unet_704x480_0.jpg +2 -2
samples/unet_704x512_0.jpg +2 -2
samples/unet_704x544_0.jpg +2 -2
samples/unet_704x576_0.jpg +2 -2
samples/unet_704x608_0.jpg +2 -2
samples/unet_704x640_0.jpg +2 -2
samples/unet_704x672_0.jpg +2 -2
samples/unet_704x704_0.jpg +2 -2
src/unet1.5b.ipynb +1 -1
test.ipynb +2 -2
train.py +2 -2
unet/config.json +1 -1
unet/diffusion_pytorch_model.safetensors +2 -2

README.md CHANGED Viewed

@@ -12,15 +12,30 @@ datasets:
 At AiArtLab, we strive to create a free, compact and fast model that can be trained on consumer graphics cards.
-- Unet:       1.5b parameters
-- Qwen3.5:    1.8b parameters
-- VAE:        32ch8x16x
-- Speed:      Sampling: 100%|██████████| 40/40 [00:01<00:00, 29.98it/s]
-### Random samples
 ![promo](media/result_grid.jpg)
-### Example
 ```
 import torch
@@ -46,9 +61,41 @@ image = pipe(
 image.show(image)
 ```
 ### VAE
-The VAE in Simple Diffusion utilizes an asymmetric VAE architecture featuring an 8x encoder and a 16x decoder. While a compression factor of 8 is maintained during training, the resolution is effectively doubled during inference through an additional upscaling block. This strategy reduces training costs by an order of magnitude and boosts inference speed without perceptual quality loss. Effectively, this acts as an integrated latent upscaler. To ensure a fair comparison with other VAEs, we downsampled the generated images to match the input resolution for metric evaluation. The SDXS VAE was not trained from scratch but was initialized from weights of FLUX 2 VAE, then redisigned and retrained.
 [eval.py](src/eval.py)
 ```
@@ -61,11 +108,16 @@ FLUX.2           | MSE=2.425e-04 PSNR=38.33 LPIPS=0.023 Edge=0.065 KL=2.160
 Wan2.2-TI2V-5B (2Gb) | MSE=7.034e-04 PSNR=34.65 LPIPS=0.050 Edge=0.115 KL=9.429
 sdxs-1b (200Mb)      | MSE=2.655e-04 PSNR=37.83 LPIPS=0.026 Edge=0.066 KL=2.170
 ```
 ### Unet
 The UNet architecture in Simple Diffusion is a direct descendant and conceptual continuation of the ideas introduced in the first version of Stable Diffusion. Key distinctions include a relatively small, yet sufficient, number of transformer blocks that ensure an even distribution of attention. Additionally, the number of channels in the final layer has been significantly increased to improve detail rendering. Overall, however, it remains a UNet, similar to SD 1.5.
-Throughout the experiments, we tested hundreds of different configurations and trained dozens of models. Notably, we initially started from the SDXL architecture, assuming it would be a stronger baseline, but ultimately abandoned all the innovations proposed in it. These included uneven attention distribution with increased transformer block depth in the lower layers, a reduced number of blocks in the channel pyramid, micro-conditioning, the dual text encoder, text-time and so on. According to our experiments, all of these changes lead to increased training time and costs while having a near-zero or negative impact on the final result. In total, the investigation of various architectures and the search for the most efficient and optimal configuration took over a year.
 Unfortunately, we were unable to secure grants for model training, with the exception of a grant from Google TPU—which, unfortunately, we were unable to utilize due to insufficient preparation and time constraints. As a result, training and experiments were financed primarily from our own funds and user donations. This left a significant mark on the model’s architecture.
 We aimed to make it as small and cost-effective to train as possible while maintaining our quality generation requirements. So perhaps the limited budget even worked to our advantage.
@@ -87,14 +139,18 @@ Additionally, the use of a full-fledged language model allowed us to integrate a
 This adventure started in December 2024 after the release of the SANA model. We received a donation from Stan for fine-tuning SANA and, together with Stas, began fine-tuning and further developing it. Despite spending the entire budget, we did not achieve significant improvements. However, we were shocked by how poorly the model was trained and designed, and we became convinced that we could do better—though we were wrong.
 Shifting Gears
 By February 2025, we split our efforts and began designing our own architectures—which we are still doing today. Stas favored the DiT architecture, while I believed in UNet. Despite some differences in architectural views, we maintained close communication, shared our work, and supported each other throughout the process. We also engaged with the AIArtLab community (a virtual Telegram chat for those contributing to model development)—thank you all for your support.
-## Lessons Learned
-One of my key mistakes was relying too heavily on LLMs and research papers. Research often presents minor improvements as groundbreaking innovations, and LLMs, trained on such content, can draw incorrect conclusions due to the abundance of clickbait. From autumn 2025, I radically changed my strategy, switching to training simpler models (VAEs), where simple fine-tuning yielded more substantial improvements than expensive research projects—including fine-tuning a VAE to a quality level comparable to Flux-1 at the time.
-This shift led me to adopt a zero-trust policy toward any external information not personally verified. As a result, I focused on building a strong local benchmark for rapid, cost-effective experiments. This led me to train models on the "Butterflies" dataset—a set of 1,000 images of butterflies—where a model could be trained from scratch in just an hour to assess the impact of a hypothesis or improvement.
 ## The Evolutionary Path
 The second turning point was the transition to a continuous evolutionary improvement strategy. Unfortunately, the Butterflies dataset does not allow for evaluating prompt-following or anatomical generation capabilities. As a result, the model evolved incrementally rather than through revolutionary changes. The same model, from December 2025, underwent around 10 changes, including radical architectural shifts—while always preserving the pre-trained weights. It’s remarkable how well and quickly pre-trained models adapt to changes in architecture and external factors, even radical ones (e.g., switching VAE models, text encoders, or their combinations).
 In addition to saving on training costs, this approach helped maintain minimal model size—for example, adding extra transformer blocks followed by an assessment of necessity and rolling back if the changes had no significant impact.
 ## The Role of Hyperparameters
-One of the initial mistakes was an excessive focus on hyperparameters during training. Ironically, 80% of training speed and quality depend on the model architecture (UNet) and the quality of embeddings (VAE), while other 20% is influenced by the text encoder’s embeddings. The rest is determined by the Adam optimizer. The irony here is that Adam is surprisingly forgiving of hyperparameter errors, so I won’t even list them.
 ## Tools and Optimization
 The model comes with two scripts:
@@ -102,8 +158,7 @@ A dataset script to convert a folder of image-text pairs into latent representat
 A training script provided as a single monolithic file.
 Additionally, there’s a script that can be pasted directly into the terminal to automatically train the model with optimized parameters.
 ## Training Optimization
-All pre-training was done using the AdamW8bit optimizer, which significantly reduced training costs. The final fine-tuning was performed using a more complex optimizer based on [https://github.com/recoilme/muon_adamw8bit](Muon + AdamW8bit).
 ### Train:

 At AiArtLab, we strive to create a free, compact and fast model that can be trained on consumer graphics cards.
+- Unet:        1.6b parameters
+- Qwen3.5:     1.8b parameters
+- VAE:         32ch8x16x
+- Speed:       Sampling: 100%|██████████| 40/40 [00:01<00:00, 29.98it/s]
+- Resolution:  from 768px to 1404px, with step 64px
+- Limitations: trained on small dataset ~1-2kk, focused on illustrations
+### Train in progress
+Key points
+ - Dec 24: Started research on Linear Transformers.
+ - Feb 25: Started research on UNet-based diffusion models.
+ - Aug 25: Started research on different VAEs.
+ - Sep 25: Created a simple VAE and a [vae collection](https://huggingface.co/AiArtLab/collections).
+ - Dec 25: Trained SDXS-1B (0.8B at this moment), featuring an SD1.5-like UNet, Long CLIP, 16-channel simple VAE, and flow matching target.
+ - Jan 25: Implemented a dual text encoder (SDXL-like style). Total rework.
+ - Feb 25: Reverted to classic architecture; tested all SDXL innovations and went back to simple diffusion. Total rework.
+ - Mar 25: Created an 32ch 8x/16x asymmetric VAE and switched to Qwen3.5 2B as text encoder.
+### Samples with seed 0
 ![promo](media/result_grid.jpg)
+### Text 2 image
 ```
 import torch
 image.show(image)
 ```
+### Image upscale
+```
+upscaled = pipe.image_upscale("media/girl.jpg")
+upscaled[0].show()
+```
+### Prompt refine
+```
+refined = pipe.refine_prompts("girl")
+print(refined)
+```
+### Encode image (experimental)
+```
+emb, mask = pipe.encode_image("media/girl.jpg")
+# Проверяем
+print("Pooled vector shape:", emb[:, 0, :].shape)
+image = pipeline(
+        prompt_embeds = emb,
+        prompt_attention_mask = mask,
+        negative_prompt = negative_prompt,
+        guidance_scale = 4,
+        width = 1088,
+        height = 1344,
+        seed = 0,
+        batch_size = 1,
+    )[0]
+image[0].show()
+```
 ### VAE
+The VAE in Simple Diffusion utilizes an asymmetric VAE architecture featuring an 8x encoder and a 16x decoder. While a compression factor of 8 is maintained during training, the resolution is effectively doubled during inference through an additional upscaling block. This strategy reduces training costs by an order of magnitude and boosts inference speed without perceptual quality loss. Effectively, this acts as an integrated latent upscaler. To ensure a fair comparison with other VAEs, we downsampled the generated images to match the input resolution for metric evaluation. The SDXS VAE was not trained from scratch but was initialized from weights of FLUX 2 VAE, then redisigned and retrained. We also trained [16 ch vae](https://huggingface.co/AiArtLab/simplevae) with flux.1 quality based on aura vae.
 [eval.py](src/eval.py)
 ```
 Wan2.2-TI2V-5B (2Gb) | MSE=7.034e-04 PSNR=34.65 LPIPS=0.050 Edge=0.115 KL=9.429
 sdxs-1b (200Mb)      | MSE=2.655e-04 PSNR=37.83 LPIPS=0.026 Edge=0.066 KL=2.170
 ```
+### Image upscale
+One interesting feature of the asymmetric VAE is the ability to use it as a standalone image and video upscaler. This VAE was trained at resolutions of 512–768 pixels and is effective within this range. It should be noted that this is a latent upscaler, making it simple and fast. It is a "blind" upscaler; unlike model-based upscalers, it interferes with the process minimally and does not alter the essence of the image. This may be useful if you dislike it when upscalers change the image style or phone model—inventing something new based on the original image. On the other hand, you might not like it, as it changes the original minimally.
 ### Unet
 The UNet architecture in Simple Diffusion is a direct descendant and conceptual continuation of the ideas introduced in the first version of Stable Diffusion. Key distinctions include a relatively small, yet sufficient, number of transformer blocks that ensure an even distribution of attention. Additionally, the number of channels in the final layer has been significantly increased to improve detail rendering. Overall, however, it remains a UNet, similar to SD 1.5.
+Throughout the experiments, we tested [hundreds](https://wandb.ai/recoilme) of different configurations and trained dozens of [models](https://huggingface.co/AiArtLab/sdxs). Notably, we initially started from the SDXL architecture, assuming it would be a stronger baseline, but ultimately abandoned all the innovations proposed in it. These included uneven attention distribution with increased transformer block depth in the lower layers, a reduced number of blocks in the channel pyramid, micro-conditioning, the dual text encoder, text-time and so on. According to our experiments, all of these changes lead to increased training time and costs while having a near-zero or negative impact on the final result. In total, the investigation of various architectures and the search for the most efficient and optimal configuration took over a year.
 Unfortunately, we were unable to secure grants for model training, with the exception of a grant from Google TPU—which, unfortunately, we were unable to utilize due to insufficient preparation and time constraints. As a result, training and experiments were financed primarily from our own funds and user donations. This left a significant mark on the model’s architecture.
 We aimed to make it as small and cost-effective to train as possible while maintaining our quality generation requirements. So perhaps the limited budget even worked to our advantage.
 This adventure started in December 2024 after the release of the SANA model. We received a donation from Stan for fine-tuning SANA and, together with Stas, began fine-tuning and further developing it. Despite spending the entire budget, we did not achieve significant improvements. However, we were shocked by how poorly the model was trained and designed, and we became convinced that we could do better—though we were wrong.
 Shifting Gears
 By February 2025, we split our efforts and began designing our own architectures—which we are still doing today. Stas favored the DiT architecture, while I believed in UNet. Despite some differences in architectural views, we maintained close communication, shared our work, and supported each other throughout the process. We also engaged with the AIArtLab community (a virtual Telegram chat for those contributing to model development)—thank you all for your support.
+## Main mistake
+One of my key mistakes was relying too heavily on LLMs and research papers. Research often presents minor improvements as groundbreaking innovations, and LLMs, trained on such content, can draw incorrect conclusions. From autumn 2025, I radically changed my strategy, switching to training simpler models (VAEs), where simple fine-tuning yielded more substantial improvements than expensive research projects—including fine-tuning a VAE to a quality level comparable to Flux-1 at the time.
+This shift led me to adopt a zero-trust policy toward any external information not personally verified. This does not mean that you should not read papers, but I urge you not to trust the conclusions presented in them. This is an extremely radical approach, and I have intentionally radicalized it, but it allowed me to transition from reading papers and implementing other people's ideas to generating my own and training models.
+As a result, I focused on building a strong local benchmark for rapid, cost-effective experiments on single rtx4080. This led me to train models on the "Butterflies" dataset—a set of 1,000 images of butterflies—where a model could be trained from scratch in just an hour to assess the impact of a hypothesis or improvement, [example](https://www.comet.com/recoilme/unet/356142c52c314078914d0c0db409e1f3?experiment-tab=images&viewId=new).
 ## The Evolutionary Path
 The second turning point was the transition to a continuous evolutionary improvement strategy. Unfortunately, the Butterflies dataset does not allow for evaluating prompt-following or anatomical generation capabilities. As a result, the model evolved incrementally rather than through revolutionary changes. The same model, from December 2025, underwent around 10 changes, including radical architectural shifts—while always preserving the pre-trained weights. It’s remarkable how well and quickly pre-trained models adapt to changes in architecture and external factors, even radical ones (e.g., switching VAE models, text encoders, or their combinations).
 In addition to saving on training costs, this approach helped maintain minimal model size—for example, adding extra transformer blocks followed by an assessment of necessity and rolling back if the changes had no significant impact.
+## tldr;
+Stop reading, start training
 ## The Role of Hyperparameters
+One of the initial mistakes was an excessive focus on hyperparameters during training. Ironically, 80% of training speed and quality depend on the model architecture (UNet) and the quality of embeddings (VAE), while other 20% is influenced by the text encoder’s embeddings. The rest is Role of Hyperparameters. The irony here is that Adam (adamw8bit) is surprisingly forgiving of hyperparameter errors, so I won’t even list them. Default is ok.
 ## Tools and Optimization
 The model comes with two scripts:
 A training script provided as a single monolithic file.
 Additionally, there’s a script that can be pasted directly into the terminal to automatically train the model with optimized parameters.
 ## Training Optimization
+All pre-training was done using the AdamW8bit optimizer, which significantly reduced training costs.
 ### Train:

girl.jpg CHANGED Viewed

Git LFS Details

SHA256: 1c805d884786deb953a5473e672f5ab8c9ccf616dcf2811011885d7c7ef767ba
Pointer size: 131 Bytes
Size of remote file: 133 kB

Git LFS Details

SHA256: 2def6f65476e848fc6076e7715421d9cf308fde93f998a478b1d169788548916
Pointer size: 131 Bytes
Size of remote file: 143 kB

media/girl.jpg CHANGED Viewed

Git LFS Details

SHA256: 9d9c7aac3206c22e5e40c29fa5f1ed2203af161921e302ae538f6cc9a20437f3
Pointer size: 130 Bytes
Size of remote file: 49.6 kB

Git LFS Details

SHA256: 2def6f65476e848fc6076e7715421d9cf308fde93f998a478b1d169788548916
Pointer size: 131 Bytes
Size of remote file: 143 kB

media/result_grid.jpg CHANGED Viewed

Git LFS Details

SHA256: 83795de2023af3ef0b99472dbcb9805c7c138573d475dae44506ceebcda808a9
Pointer size: 132 Bytes
Size of remote file: 7.34 MB

Git LFS Details

SHA256: a9c9cdd8c1fcb06b9abf6fad5b80043e216194a1f89187d1d42b40ad078cdd2c
Pointer size: 131 Bytes
Size of remote file: 455 kB

model_index.json CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:b9fe0891c1d3f4f0b2a8cbca077be3533f28306768c3ea8d5256924fc677a4b1
-size 438

 version https://git-lfs.github.com/spec/v1
+oid sha256:d717995bd1e270fd4694a62255b159df8ec189022ac70567e1d888ac8959161b
+size 503

pipeline_sdxs.py CHANGED Viewed

@@ -7,7 +7,6 @@ from dataclasses import dataclass
 from diffusers import DiffusionPipeline
 from diffusers.utils import BaseOutput
 from tqdm import tqdm
-from transformers import Qwen3_5ForConditionalGeneration, Qwen3_5Tokenizer
 @dataclass
 class SdxsPipelineOutput(BaseOutput):
@@ -15,11 +14,14 @@ class SdxsPipelineOutput(BaseOutput):
     prompt: Optional[Union[str, List[str]]] = None
 class SdxsPipeline(DiffusionPipeline):
-    def __init__(self, vae, text_encoder, tokenizer, unet, scheduler):
         super().__init__()
         self.register_modules(
             vae=vae,
             text_encoder=text_encoder,
             tokenizer=tokenizer,
             unet=unet,
             scheduler=scheduler
@@ -30,109 +32,252 @@ class SdxsPipeline(DiffusionPipeline):
         if mean is not None and std is not None:
             self.vae_latents_std = torch.tensor(std, device=self.unet.device, dtype=self.unet.dtype).view(1, len(std), 1, 1)
             self.vae_latents_mean = torch.tensor(mean, device=self.unet.device, dtype=self.unet.dtype).view(1, len(mean), 1, 1)
-    def preprocess_image(self, image: Image.Image, width: int, height: int):
-        """Ресайз и центрированный кроп изображения для асимметричного VAE."""
-        # Для энкодера с масштабом 8
-        target_height = ((height // self.vae_scale_factor) * self.vae_scale_factor)
-        target_width = ((width // self.vae_scale_factor) * self.vae_scale_factor)
-        w, h = image.size
-        aspect_ratio = target_width / target_height
-        if w / h > aspect_ratio:
-            new_w = int(h * aspect_ratio)
-            left = (w - new_w) // 2
-            image = image.crop((left, 0, left + new_w, h))
-        else:
-            new_h = int(w / aspect_ratio)
-            top = (h - new_h) // 2
-            image = image.crop((0, top, w, top + new_h))
-        image = image.resize((target_width, target_height), resample=Image.LANCZOS)
-        image = np.array(image).astype(np.float32) / 255.0
-        image = image[None].transpose(0, 3, 1, 2)  # [1, C, H, W]
-        image = torch.from_numpy(image)
-        return 2.0 * image - 1.0  # [-1, 1]
-    def encode_prompt(self, prompt, negative_prompt, device, dtype):
-        def get_encode(texts):
-            if texts is None:
-                texts = ""
-            if isinstance(texts, str):
-                texts = [texts]
-            with torch.no_grad():
-                # 1. Собираем текстовые промпты оборачивая их в Chat Template
-                formatted_prompts = []
-                for t in texts:
-                    messages = [{"role": "user", "content": [{"type": "text", "text": t}]}]
-                    res_text = self.tokenizer.apply_chat_template(
-                        messages,
-                        add_generation_prompt=True,
-                        tokenize=False
-                    )
-                    formatted_prompts.append(res_text)
-                # 2. Токенизируем, режем и добавляем паддинг за один раз
-                toks = self.tokenizer(
-                    formatted_prompts,
-                    padding="max_length",
-                    max_length=248,
-                    truncation=True, # Не забываем обрезать, если вдруг длиннее
-                    return_tensors="pt"
-                ).to(device)
-                # 3. Прогоняем через модель
-                outputs = self.text_encoder(
-                    input_ids=toks.input_ids,
-                    attention_mask=toks.attention_mask,
-                    output_hidden_states=True
-                )
-                layer_index = -2
-                last_hidden = outputs.hidden_states[layer_index]
-                seq_len = toks.attention_mask.sum(dim=1) - 1
-                pooled = last_hidden[torch.arange(len(last_hidden)), seq_len.clamp(min=0)]
-                # --- НОВАЯ ЛОГИКА: ОБЪЕДИНЕНИЕ ДЛЯ КРОСС-ВНИМАНИЯ ---
-                # 1. Расширяем пулинг-вектор до последовательности [B, 1, 1024]
-                pooled_expanded = pooled.unsqueeze(1)
-                # 2. Объединяем последовательность токенов и пулинг-вектор
-                # !!! ИЗМЕНЕНИЕ ЗДЕСЬ !!!: Пулинг идет ПЕРВЫМ
-                # Теперь: [B, 1 + L, 1024]. Пулинг стал токеном в НАЧАЛЕ.
-                new_encoder_hidden_states = torch.cat([pooled_expanded, last_hidden], dim=1)
-                # 3. Обновляем маску внимания для нового токена
-                # Маска внимания: [B, 1 + L]. Добавляем 1 в НАЧАЛО.
-                # torch.ones((batch_size, 1), device=device) создает маску [B, 1] со значениями 1.
-                new_attention_mask = torch.cat([torch.ones((last_hidden.shape[0], 1), device=device), toks.attention_mask], dim=1)
-                return new_encoder_hidden_states, new_attention_mask
-        pos_embeds, pos_mask = get_encode(prompt)
-        neg_embeds, neg_mask = get_encode(negative_prompt)
-        batch_size = pos_embeds.shape[0]
-        if neg_embeds.shape[0] != batch_size:
-            neg_embeds = neg_embeds.repeat(batch_size, 1, 1)
-            neg_mask = neg_mask.repeat(batch_size, 1)
-        text_embeddings = torch.cat([neg_embeds, pos_embeds], dim=0)
-        final_mask = torch.cat([neg_mask, pos_mask], dim=0)
-        return text_embeddings.to(dtype=dtype), final_mask.to(dtype=torch.int64)
     @torch.no_grad()
     def __call__(
         self,
-        prompt: Union[str, List[str]],
-        image: Optional[Union[Image.Image, List[Image.Image]]] = None,
-        coef: float = 0.97, # ← strength (0.0 = оригинал, 1.0 = полный шум)
         negative_prompt: Optional[Union[str, List[str]]] = None,
         height: int = 1024,
         width: int = 1024,
         num_inference_steps: int = 40,
@@ -141,7 +286,6 @@ class SdxsPipeline(DiffusionPipeline):
         seed: Optional[int] = None,
         output_type: str = "pil",
         return_dict: bool = True,
-        refine_prompt: bool = False,
         **kwargs,
     ):
         device = self.device
@@ -149,115 +293,81 @@ class SdxsPipeline(DiffusionPipeline):
         if generator is None and seed is not None:
             generator = torch.Generator(device=device).manual_seed(seed)
-        # ==================== REFINE PROMPT (INLINE) ====================
-        if refine_prompt and prompt:
-            sys_msg = (
-                "You are a skilled text-to-image prompt engineer whose sole function is to transform the user's input into an aesthetically optimized, detailed, and visually descriptive three-sentence output. "
-                "**The primary subject (e.g., 'girl', 'dog', 'house') MUST be the main focus of the revised prompt and MUST be described in rich detail within the first sentence or two.** "
-                "Output **only** the final revised prompt, with absolutely no commentary.\n Don't use cliches like warm,soft,vibrant, wildflowers. Be creative. User input prompt: "
-            )
-            prompts_list = [prompt] if isinstance(prompt, str) else prompt
-            refined_list = []
-            for p in prompts_list:
-                messages = [{"role": "user", "content": [{"type": "text", "text": sys_msg + p}]}]
-                # Используем Qwen-Instruct формат (apply_chat_template сам подставит system/user/assistant токены)
-                inputs = self.tokenizer.apply_chat_template(
-                    messages,
-                    tokenize=True,
-                    add_generation_prompt=True,
-                    return_dict=True,
-                    return_tensors="pt"
-                ).to(device)
-                generated_ids = self.text_encoder.generate(
-                    **inputs, max_new_tokens=248, do_sample=True,temperature = 0.7
-                )
-                # Обрезаем входные токены из ответа
-                generated_ids_trimmed = [
-                    out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
-                ]
-                output_text = self.tokenizer.batch_decode(
-                    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
-                )
-                refined_list.append(output_text)
-            prompt = refined_list[0] if isinstance(prompt, str) else refined_list
-        # ==================== ENCODE PROMPTS ====================
-        text_embeddings, attention_mask = self.encode_prompt(
-            prompt, negative_prompt, device, dtype
-        )
-        batch_size = 1 if isinstance(prompt, str) else len(prompt)
-        # 2. Scheduler timesteps
         self.scheduler.set_timesteps(num_inference_steps, device=device)
         timesteps = self.scheduler.timesteps
-        # ==================== IMG2IMG БЛОК (НОВАЯ ВЕРСИЯ) ====================
-        if image is not None:
-            # --- Подготовка изображения ---
-            if isinstance(image, Image.Image):
-                image_tensor = self.preprocess_image(image, width, height).to(device, self.vae.dtype)
-            else:
-                image_tensor = self.preprocess_image(image[0], width, height).to(device, self.vae.dtype)
-            # --- Кодируем в latent ---
-            latents_clean = self.vae.encode(image_tensor).latent_dist.sample(generator=generator)
-            latents_clean = (latents_clean - self.vae_latents_mean.to(device, self.vae.dtype)) / self.vae_latents_std.to(device, self.vae.dtype)
-            latents_clean = latents_clean.to(dtype)
-            # --- Добавляем шум по Rectified Flow формуле ---
-            noise = torch.randn_like(latents_clean)
-            # coef = strength (0.0 → оригинал, 1.0 → чистый шум)
-            sigma = coef                                      # в Flow Matching sigma = t
-            if hasattr(self.scheduler, "sigma_shift"):        # если есть shift (Flux-style)
-                sigma = self.scheduler.sigma_shift(sigma)
-            latents = (1.0 - sigma) * latents_clean + sigma * noise
-            # Обрезаем timesteps начиная с текущего sigma
-            init_timestep = int(num_inference_steps * coef)
-            t_start = max(num_inference_steps - init_timestep, 0)
-            timesteps = timesteps[t_start:]
         else:
-            # txt2img
-            latent_h = height // self.vae_scale_factor
-            latent_w = width // self.vae_scale_factor
-            latents = torch.randn(
-                (batch_size, self.unet.config.in_channels, latent_h, latent_w),
-                generator=generator, device=device, dtype=dtype
-            )
-        # ==================== DENOISING LOOP (одинаковый для txt2img и img2img) ====================
         for i, t in enumerate(tqdm(timesteps, desc="Sampling")):
-            latent_model_input = torch.cat([latents] * 2) if guidance_scale > 1.0 else latents
             model_out = self.unet(
-                latent_model_input,
-                t,
                 encoder_hidden_states=text_embeddings,
                 encoder_attention_mask=attention_mask,
                 return_dict=False,
             )[0]
-            if guidance_scale > 1.0:
                 flow_uncond, flow_cond = model_out.chunk(2)
                 model_out = flow_uncond + guidance_scale * (flow_cond - flow_uncond)
-            # Важно: используем scheduler.step — он сам знает, что делать с velocity
             latents = self.scheduler.step(model_out, t, latents, return_dict=False)[0]
-        # ==================== DECODE ====================
         if output_type == "latent":
             if not return_dict: return (latents, prompt)
-            return SdxsPipelineOutput(images=latents, prompt=prompt)
         latents = latents * self.vae_latents_std.to(device, self.vae.dtype) + self.vae_latents_mean.to(device, self.vae.dtype)
         image_output = self.vae.decode(latents.to(self.vae.dtype), return_dict=False)[0]
@@ -271,5 +381,5 @@ class SdxsPipeline(DiffusionPipeline):
             images = image_np
         if not return_dict:
-            return (images, prompt)
-        return SdxsPipelineOutput(images=images, prompt=prompt)

 from diffusers import DiffusionPipeline
 from diffusers.utils import BaseOutput
 from tqdm import tqdm
 @dataclass
 class SdxsPipelineOutput(BaseOutput):
     prompt: Optional[Union[str, List[str]]] = None
 class SdxsPipeline(DiffusionPipeline):
+    MAX_TEXT_TOKENS = 248
+    def __init__(self, vae, text_encoder, processor, tokenizer, unet, scheduler):
         super().__init__()
         self.register_modules(
             vae=vae,
             text_encoder=text_encoder,
+            processor=processor,
             tokenizer=tokenizer,
             unet=unet,
             scheduler=scheduler
         if mean is not None and std is not None:
             self.vae_latents_std = torch.tensor(std, device=self.unet.device, dtype=self.unet.dtype).view(1, len(std), 1, 1)
             self.vae_latents_mean = torch.tensor(mean, device=self.unet.device, dtype=self.unet.dtype).view(1, len(mean), 1, 1)
+    @staticmethod
+    def _pad_tensor_to_length(tensor: torch.Tensor, target_len: int, dim: int = 1, pad_value: float = 0) -> torch.Tensor:
+        current_len = tensor.shape[dim]
+        if current_len >= target_len:
+            return tensor
+        pad_size = target_len - current_len
+        if tensor.dim() == 3:
+            padding = (0, 0, 0, pad_size, 0, 0)
+        elif tensor.dim() == 2:
+            padding = (0, pad_size, 0, 0)
+        else:
+            raise ValueError(f"Unsupported tensor dimension: {tensor.dim()}")
+        return torch.nn.functional.pad(tensor, padding, value=pad_value)
+    @torch.no_grad()
+    def refine_prompts(
+        self,
+        prompts: Union[str, List[str]],
+        system_prompt: Optional[str] = None,
+        temperature: float = 0.7
+    ) -> List[str]:
+        """
+        Refines a list of prompts using the Text Encoder (LLM).
+        Args:
+            prompts: Single prompt string or list of prompts.
+            system_prompt: Custom instruction for the LLM. If None, uses default aesthetic enhancer.
+            temperature: Sampling temperature for generation.
+        Returns:
+            List of refined prompts.
+        """
+        device = self.device
+        # Default system prompt if none provided
+        if system_prompt is None:
+            system_prompt = (
+                "You are a skilled text-to-image prompt engineer whose sole function is to transform "
+                "the user's input into an aesthetically optimized, detailed, and visually descriptive three-sentence output. "
+                "**The primary subject (e.g., 'girl', 'dog', 'house') MUST be the main focus of the revised prompt "
+                "and MUST be described in rich detail within the first sentence or two.** "
+                "Output **only** the final revised prompt, with absolutely no commentary. "
+                "Don't use cliches like warm, soft, vibrant, wildflowers. Be creative. User input prompt: "
+            )
+        pad_id = getattr(self.text_encoder.config, "pad_token_id", None) or \
+                 getattr(self.text_encoder.config, "eos_token_id", None)
+        prompts_list = [prompts] if isinstance(prompts, str) else prompts
+        refined_list = []
+        for p in prompts_list:
+            # Prepend system prompt to user input
+            full_text = system_prompt + p
+            messages = [{"role": "user", "content": [{"type": "text", "text": full_text}]}]
+            inputs = self.tokenizer.apply_chat_template(
+                messages, tokenize=True, add_generation_prompt=True, return_dict=True, return_tensors="pt"
+            ).to(device)
+            generated_ids = self.text_encoder.generate(
+                **inputs,
+                max_new_tokens=self.MAX_TEXT_TOKENS,
+                do_sample=True,
+                temperature=temperature,
+                pad_token_id=pad_id
+            )
+            generated_ids_trimmed = [
+                out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
+            ]
+            output_text = self.tokenizer.batch_decode(
+                generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
+            )
+            refined_list.append(output_text[0])
+        return refined_list
+    @torch.no_grad()
+    def encode_text(self, text: Union[str, List[str]]) -> Tuple[torch.Tensor, torch.Tensor]:
+        device = self.device
+        dtype = self.unet.dtype
+        if text is None: text = ""
+        if isinstance(text, str): text = [text]
+        formatted_prompts = []
+        for t in text:
+            messages = [{"role": "user", "content": [{"type": "text", "text": t}]}]
+            formatted_prompts.append(self.tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False))
+        toks = self.tokenizer(formatted_prompts, padding="max_length", max_length=self.MAX_TEXT_TOKENS, truncation=True, return_tensors="pt").to(device)
+        outputs = self.text_encoder(input_ids=toks.input_ids, attention_mask=toks.attention_mask, output_hidden_states=True)
+        last_hidden = outputs.hidden_states[-2]
+        seq_len = toks.attention_mask.sum(dim=1) - 1
+        pooled = last_hidden[torch.arange(len(last_hidden)), seq_len.clamp(min=0)]
+        pooled_expanded = pooled.unsqueeze(1)
+        encoder_hidden_states = torch.cat([pooled_expanded, last_hidden], dim=1)
+        attention_mask = torch.cat([torch.ones((last_hidden.shape[0], 1), device=device, dtype=toks.attention_mask.dtype), toks.attention_mask], dim=1)
+        return encoder_hidden_states.to(dtype=dtype), attention_mask.to(dtype=torch.int64)
+    @torch.no_grad()
+    def encode_image(self, image: Union[Image.Image, str, List[Union[Image.Image, str]]]) -> Tuple[torch.Tensor, torch.Tensor]:
+        device = self.device
+        dtype = self.unet.dtype
+        if isinstance(image, (str, Image.Image)): image = [image]
+        batch_size = len(image)
+        all_messages = [[{"role": "user", "content": [{"type": "image", "image": img}]}] for img in image]
+        formatted_prompts = [self.processor.apply_chat_template(msgs, tokenize=False, add_generation_prompt=True) for msgs in all_messages]
+        inputs = self.processor(text=formatted_prompts, images=image, return_tensors="pt", padding=True, truncation=False).to(device)
+        outputs = self.text_encoder(**inputs, output_hidden_states=True)
+        last_hidden = outputs.hidden_states[-2]
+        seq_lens = inputs.attention_mask.sum(dim=1) - 1
+        pooled = last_hidden[torch.arange(batch_size), seq_lens.clamp(min=0)]
+        final_embeddings = torch.cat([pooled.unsqueeze(1), last_hidden], dim=1)
+        final_mask = torch.cat([torch.ones((batch_size, 1), device=device, dtype=inputs.attention_mask.dtype), inputs.attention_mask], dim=1)
+        return final_embeddings.to(dtype=dtype), final_mask.to(dtype=torch.int64)
+    @torch.no_grad()
+    def encode_text_and_image_naive(self, text: Union[str, List[str]], image: Optional[Union[Image.Image, List[Image.Image], str, List[str]]] = None, scale = 0.5) -> Tuple[torch.Tensor, torch.Tensor]:
+        # 1. Получаем текстовый эмбеддинг
+        text_embeds, text_mask = self.encode_text(text)
+        if image is not None:
+            if isinstance(image, (str, Image.Image)):
+                image = [image]
+            # Если картинка одна, а текстов много - размножаем картинку
+            if len(image) == 1 and text_embeds.shape[0] > 1:
+                image = image * text_embeds.shape[0]
+            # --- НАЧАЛО ВСТАВЛЕННОГО КОДА (Логика из encode_image) ---
+            device = self.device
+            dtype = self.unet.dtype
+            batch_size = len(image)
+            all_messages = [[{"role": "user", "content": [{"type": "image", "image": img}]}] for img in image]
+            formatted_prompts = [self.processor.apply_chat_template(msgs, tokenize=False, add_generation_prompt=True) for msgs in all_messages]
+            inputs = self.processor(text=formatted_prompts, images=image, return_tensors="pt", padding=True, truncation=False).to(device)
+            outputs = self.text_encoder(**inputs, output_hidden_states=True)
+            # Берем нужный хайден (-2 слой)
+            img_hidden_states = outputs.hidden_states[-2]
+            # Берем маску attention из процессора
+            img_mask = inputs.attention_mask
+            # --- КОНЕЦ ВСТАВЛЕННОГО КОДА ---
+            # Применяем масштабирование
+            if scale != 1.0:
+                img_hidden_states = img_hidden_states * scale
+            # Приводим маску и типы данных к соответствию с текстом
+            img_mask = img_mask.to(text_mask.dtype)
+            img_hidden_states = img_hidden_states.to(dtype=dtype)
+            # Объединяем текст и последовательность токенов картинки
+            final_embeds = torch.cat([text_embeds, img_hidden_states], dim=1)
+            final_mask = torch.cat([text_mask, img_mask], dim=1)
+            return final_embeds, final_mask
+        return text_embeds, text_mask
+    @torch.no_grad()
+    def image_upscale(
+        self,
+        image: Union[str, Image.Image, List[Union[str, Image.Image]]],
+        batch_size: int = 1
+    ) -> List[Image.Image]:
+        """
+        Upscales images using asymmetric VAE (x2).
+        Uses smart batching: processes in parallel if sizes match, else falls back to sequential.
+        """
+        images = [image] if isinstance(image, (str, Image.Image)) else image
+        # 1. Preprocess: Load, Handle Alpha, Pad to %8, Normalize
+        batch_data = []
+        for img in images:
+            if isinstance(img, str): img = Image.open(img)
+            if img.mode == "RGBA":
+                img = Image.alpha_composite(Image.new("RGBA", img.size, (255, 255, 255)), img)
+            img = img.convert("RGB")
+            w, h = img.size
+            pw, ph = (8 - w % 8) % 8, (8 - h % 8) % 8
+            if pw or ph:
+                padded = Image.new("RGB", (w + pw, h + ph), (255, 255, 255))
+                padded.paste(img)
+                img = padded
+            t = torch.from_numpy(np.array(img).astype(np.float32) / 127.5 - 1.0).permute(2, 0, 1)
+            batch_data.append((t.to(self.device, torch.float16), w, h))
+        # 2. Determine Execution Strategy
+        # If all shapes are identical, use batch_size. Else fallback to 1.
+        unique_shapes = {t.shape for t, _, _ in batch_data}
+        step = batch_size if len(unique_shapes) == 1 else 1
+        output_images = []
+        # 3. Process Batches
+        for i in range(0, len(batch_data), step):
+            chunk = batch_data[i : i + step]
+            # Stack tensors [B, C, H, W]
+            tensors = torch.stack([c[0] for c in chunk])
+            # Encode -> Decode (using mean for deterministic upscale)
+            latents = self.vae.encode(tensors).latent_dist.mean
+            latents = latents * self.vae_latents_std.to(latents) + self.vae_latents_mean.to(latents)
+            decoded = self.vae.decode(latents.to(self.vae.dtype))[0]
+            # 4. Post-process: Denormalize and Crop
+            decoded = (decoded.clamp(-1, 1) + 1) / 2
+            for j, tensor in enumerate(decoded):
+                w, h = chunk[j][1], chunk[j][2] # Original sizes
+                # Crop to exact 2x
+                arr = tensor.cpu().permute(1, 2, 0).float().numpy()
+                arr = arr[:h * 2, :w * 2]
+                output_images.append(Image.fromarray((arr * 255).astype("uint8")))
+        return output_images
     @torch.no_grad()
     def __call__(
         self,
+        prompt: Optional[Union[str, List[str]]] = None,
         negative_prompt: Optional[Union[str, List[str]]] = None,
+        prompt_embeds: Optional[torch.Tensor] = None,
+        negative_prompt_embeds: Optional[torch.Tensor] = None,
+        prompt_attention_mask: Optional[torch.Tensor] = None,
+        negative_prompt_attention_mask: Optional[torch.Tensor] = None,
+        latents: Optional[torch.Tensor] = None,
         height: int = 1024,
         width: int = 1024,
         num_inference_steps: int = 40,
         seed: Optional[int] = None,
         output_type: str = "pil",
         return_dict: bool = True,
         **kwargs,
     ):
         device = self.device
         if generator is None and seed is not None:
             generator = torch.Generator(device=device).manual_seed(seed)
+        do_classifier_free_guidance = guidance_scale > 1.0
+        # 1. Encode Positive
+        if prompt_embeds is None:
+            if prompt is None: raise ValueError("`prompt` or `prompt_embeds` required.")
+            prompt_embeds, prompt_attention_mask = self.encode_text(prompt)
+        prompt_embeds = prompt_embeds.to(device=device, dtype=dtype)
+        prompt_attention_mask = prompt_attention_mask.to(device=device, dtype=torch.int64)
+        batch_size = prompt_embeds.shape[0]
+        # 2. Encode Negative (only if CFG is enabled)
+        if do_classifier_free_guidance:
+            if negative_prompt_embeds is None:
+                neg_text = negative_prompt if negative_prompt is not None else ("" if isinstance(prompt, str) else [""] * len(prompt))
+                negative_prompt_embeds, negative_prompt_attention_mask = self.encode_text(neg_text)
+            negative_prompt_embeds = negative_prompt_embeds.to(device=device, dtype=dtype)
+            negative_prompt_attention_mask = negative_prompt_attention_mask.to(device=device, dtype=torch.int64)
+            # Batch size matching
+            if negative_prompt_embeds.shape[0] != batch_size:
+                negative_prompt_embeds = negative_prompt_embeds.repeat(batch_size, 1, 1)
+                negative_prompt_attention_mask = negative_prompt_attention_mask.repeat(batch_size, 1)
+            # 3. Align Length (Padding) for Concat
+            max_len = max(prompt_embeds.shape[1], negative_prompt_embeds.shape[1])
+            prompt_embeds = self._pad_tensor_to_length(prompt_embeds, max_len, dim=1, pad_value=0)
+            negative_prompt_embeds = self._pad_tensor_to_length(negative_prompt_embeds, max_len, dim=1, pad_value=0)
+            prompt_attention_mask = self._pad_tensor_to_length(prompt_attention_mask, max_len, dim=1, pad_value=0)
+            negative_prompt_attention_mask = self._pad_tensor_to_length(negative_prompt_attention_mask, max_len, dim=1, pad_value=0)
+            # 4. Concatenate for CFG: [Neg, Pos]
+            text_embeddings = torch.cat([negative_prompt_embeds, prompt_embeds], dim=0)
+            attention_mask = torch.cat([negative_prompt_attention_mask, prompt_attention_mask], dim=0)
+        else:
+            # If no CFG, we just use positive embeddings as is
+            text_embeddings = prompt_embeds
+            attention_mask = prompt_attention_mask
+        # 5. Scheduler & Latents
         self.scheduler.set_timesteps(num_inference_steps, device=device)
         timesteps = self.scheduler.timesteps
+        latent_h = height // self.vae_scale_factor
+        latent_w = width // self.vae_scale_factor
+        if latents is None:
+            latents = torch.randn((batch_size, self.unet.config.in_channels, latent_h, latent_w), generator=generator, device=device, dtype=dtype)
         else:
+            latents = latents.to(device=device, dtype=dtype)
+        # 6. Denoising Loop
         for i, t in enumerate(tqdm(timesteps, desc="Sampling")):
+            # Duplicate latents only if doing CFG
+            latent_model_input = torch.cat([latents] * 2) if do_classifier_free_guidance else latents
             model_out = self.unet(
+                latent_model_input, t,
                 encoder_hidden_states=text_embeddings,
                 encoder_attention_mask=attention_mask,
                 return_dict=False,
             )[0]
+            # Perform CFG guidance
+            if do_classifier_free_guidance:
                 flow_uncond, flow_cond = model_out.chunk(2)
                 model_out = flow_uncond + guidance_scale * (flow_cond - flow_uncond)
             latents = self.scheduler.step(model_out, t, latents, return_dict=False)[0]
+        # 7. Decode
         if output_type == "latent":
             if not return_dict: return (latents, prompt)
+            return SdxsPipelineOutput(images=latents)
         latents = latents * self.vae_latents_std.to(device, self.vae.dtype) + self.vae_latents_mean.to(device, self.vae.dtype)
         image_output = self.vae.decode(latents.to(self.vae.dtype), return_dict=False)[0]
             images = image_np
         if not return_dict:
+            return (images,)
+        return SdxsPipelineOutput(images=images)

processor/chat_template.jinja ADDED Viewed

	@@ -0,0 +1,154 @@

+{%- set image_count = namespace(value=0) %}
+{%- set video_count = namespace(value=0) %}
+{%- macro render_content(content, do_vision_count, is_system_content=false) %}
+    {%- if content is string %}
+        {{- content }}
+    {%- elif content is iterable and content is not mapping %}
+        {%- for item in content %}
+            {%- if 'image' in item or 'image_url' in item or item.type == 'image' %}
+                {%- if is_system_content %}
+                    {{- raise_exception('System message cannot contain images.') }}
+                {%- endif %}
+                {%- if do_vision_count %}
+                    {%- set image_count.value = image_count.value + 1 %}
+                {%- endif %}
+                {%- if add_vision_id %}
+                    {{- 'Picture ' ~ image_count.value ~ ': ' }}
+                {%- endif %}
+                {{- '<|vision_start|><|image_pad|><|vision_end|>' }}
+            {%- elif 'video' in item or item.type == 'video' %}
+                {%- if is_system_content %}
+                    {{- raise_exception('System message cannot contain videos.') }}
+                {%- endif %}
+                {%- if do_vision_count %}
+                    {%- set video_count.value = video_count.value + 1 %}
+                {%- endif %}
+                {%- if add_vision_id %}
+                    {{- 'Video ' ~ video_count.value ~ ': ' }}
+                {%- endif %}
+                {{- '<|vision_start|><|video_pad|><|vision_end|>' }}
+            {%- elif 'text' in item %}
+                {{- item.text }}
+            {%- else %}
+                {{- raise_exception('Unexpected item type in content.') }}
+            {%- endif %}
+        {%- endfor %}
+    {%- elif content is none or content is undefined %}
+        {{- '' }}
+    {%- else %}
+        {{- raise_exception('Unexpected content type.') }}
+    {%- endif %}
+{%- endmacro %}
+{%- if not messages %}
+    {{- raise_exception('No messages provided.') }}
+{%- endif %}
+{%- if tools and tools is iterable and tools is not mapping %}
+    {{- '<|im_start|>system\n' }}
+    {{- "# Tools\n\nYou have access to the following functions:\n\n<tools>" }}
+    {%- for tool in tools %}
+        {{- "\n" }}
+        {{- tool | tojson }}
+    {%- endfor %}
+    {{- "\n</tools>" }}
+    {{- '\n\nIf you choose to call a function ONLY reply in the following format with NO suffix:\n\n<tool_call>\n<function=example_function_name>\n<parameter=example_parameter_1>\nvalue_1\n</parameter>\n<parameter=example_parameter_2>\nThis is the value for the second parameter\nthat can span\nmultiple lines\n</parameter>\n</function>\n</tool_call>\n\n<IMPORTANT>\nReminder:\n- Function calls MUST follow the specified format: an inner <function=...></function> block must be nested within <tool_call></tool_call> XML tags\n- Required parameters MUST be specified\n- You may provide optional reasoning for your function call in natural language BEFORE the function call, but NOT after\n- If there is no function call available, answer the question like normal with your current knowledge and do not tell the user about function calls\n</IMPORTANT>' }}
+    {%- if messages[0].role == 'system' %}
+        {%- set content = render_content(messages[0].content, false, true)|trim %}
+        {%- if content %}
+            {{- '\n\n' + content }}
+        {%- endif %}
+    {%- endif %}
+    {{- '<|im_end|>\n' }}
+{%- else %}
+    {%- if messages[0].role == 'system' %}
+        {%- set content = render_content(messages[0].content, false, true)|trim %}
+        {{- '<|im_start|>system\n' + content + '<|im_end|>\n' }}
+    {%- endif %}
+{%- endif %}
+{%- set ns = namespace(multi_step_tool=true, last_query_index=messages|length - 1) %}
+{%- for message in messages[::-1] %}
+    {%- set index = (messages|length - 1) - loop.index0 %}
+    {%- if ns.multi_step_tool and message.role == "user" %}
+        {%- set content = render_content(message.content, false)|trim %}
+        {%- if not(content.startswith('<tool_response>') and content.endswith('</tool_response>')) %}
+            {%- set ns.multi_step_tool = false %}
+            {%- set ns.last_query_index = index %}
+        {%- endif %}
+    {%- endif %}
+{%- endfor %}
+{%- if ns.multi_step_tool %}
+    {{- raise_exception('No user query found in messages.') }}
+{%- endif %}
+{%- for message in messages %}
+    {%- set content = render_content(message.content, true)|trim %}
+    {%- if message.role == "system" %}
+        {%- if not loop.first %}
+            {{- raise_exception('System message must be at the beginning.') }}
+        {%- endif %}
+    {%- elif message.role == "user" %}
+        {{- '<|im_start|>' + message.role + '\n' + content + '<|im_end|>' + '\n' }}
+    {%- elif message.role == "assistant" %}
+        {%- set reasoning_content = '' %}
+        {%- if message.reasoning_content is string %}
+            {%- set reasoning_content = message.reasoning_content %}
+        {%- else %}
+            {%- if '</think>' in content %}
+                {%- set reasoning_content = content.split('</think>')[0].rstrip('\n').split('<think>')[-1].lstrip('\n') %}
+                {%- set content = content.split('</think>')[-1].lstrip('\n') %}
+            {%- endif %}
+        {%- endif %}
+        {%- set reasoning_content = reasoning_content|trim %}
+        {%- if loop.index0 > ns.last_query_index %}
+            {{- '<|im_start|>' + message.role + '\n<think>\n' + reasoning_content + '\n</think>\n\n' + content }}
+        {%- else %}
+            {{- '<|im_start|>' + message.role + '\n' + content }}
+        {%- endif %}
+        {%- if message.tool_calls and message.tool_calls is iterable and message.tool_calls is not mapping %}
+            {%- for tool_call in message.tool_calls %}
+                {%- if tool_call.function is defined %}
+                    {%- set tool_call = tool_call.function %}
+                {%- endif %}
+                {%- if loop.first %}
+                    {%- if content|trim %}
+                        {{- '\n\n<tool_call>\n<function=' + tool_call.name + '>\n' }}
+                    {%- else %}
+                        {{- '<tool_call>\n<function=' + tool_call.name + '>\n' }}
+                    {%- endif %}
+                {%- else %}
+                    {{- '\n<tool_call>\n<function=' + tool_call.name + '>\n' }}
+                {%- endif %}
+                {%- if tool_call.arguments is defined %}
+                    {%- for args_name, args_value in tool_call.arguments|items %}
+                        {{- '<parameter=' + args_name + '>\n' }}
+                        {%- set args_value = args_value | tojson | safe if args_value is mapping or (args_value is sequence and args_value is not string) else args_value | string %}
+                        {{- args_value }}
+                        {{- '\n</parameter>\n' }}
+                    {%- endfor %}
+                {%- endif %}
+                {{- '</function>\n</tool_call>' }}
+            {%- endfor %}
+        {%- endif %}
+        {{- '<|im_end|>\n' }}
+    {%- elif message.role == "tool" %}
+        {%- if loop.previtem and loop.previtem.role != "tool" %}
+            {{- '<|im_start|>user' }}
+        {%- endif %}
+        {{- '\n<tool_response>\n' }}
+        {{- content }}
+        {{- '\n</tool_response>' }}
+        {%- if not loop.last and loop.nextitem.role != "tool" %}
+            {{- '<|im_end|>\n' }}
+        {%- elif loop.last %}
+            {{- '<|im_end|>\n' }}
+        {%- endif %}
+    {%- else %}
+        {{- raise_exception('Unexpected message role.') }}
+    {%- endif %}
+{%- endfor %}
+{%- if add_generation_prompt %}
+    {{- '<|im_start|>assistant\n' }}
+    {%- if enable_thinking is defined and enable_thinking is true %}
+        {{- '<think>\n' }}
+    {%- else %}
+        {{- '<think>\n\n</think>\n\n' }}
+    {%- endif %}
+{%- endif %}

processor/processor_config.json ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:14932921ca485d458a04dafd8069fbb0a4505622a48208d19ed247115801385b
+size 1300

processor/tokenizer.json ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:87a7830d63fcf43bf241c3c5242e96e62dd3fdc29224ca26fed8ea333db72de4
+size 19989343

processor/tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:e98f1901ac6f0adff67b1d540bfa0c36ac1a0cf59eb72ed78146ef89aafa1182
+size 1139

samples/unet_384x704_0.jpg CHANGED Viewed

Git LFS Details

SHA256: 759eb0dd6eee67ad062e24cd39511d24498e90c765d555bc0d051231790d2612
Pointer size: 131 Bytes
Size of remote file: 480 kB

Git LFS Details

SHA256: 62e5e25dd1f3bbe5369d14da45241bd4b8a00852b33b452e3c2f9574c9073d55
Pointer size: 131 Bytes
Size of remote file: 253 kB

samples/unet_416x704_0.jpg CHANGED Viewed

Git LFS Details

SHA256: cd420618126075ebfdd645b9ecc354e9a87eaffd86d677a84234b166669fa998
Pointer size: 131 Bytes
Size of remote file: 383 kB

Git LFS Details

SHA256: f3a8b1425b4857fb890e4879cbf3427aa4d8e60ecd3a1a78033ef17aec8c4347
Pointer size: 131 Bytes
Size of remote file: 322 kB

samples/unet_448x704_0.jpg CHANGED Viewed

Git LFS Details

SHA256: 545d587255b90cc6e580d4baa91a0edd1b75f90387dc9b261524e0d23d2ab840
Pointer size: 131 Bytes
Size of remote file: 335 kB

Git LFS Details

SHA256: f853ab52729edebbd5e858743bb2ae5429b91721d105643ebcf6a18467a58515
Pointer size: 131 Bytes
Size of remote file: 465 kB

samples/unet_480x704_0.jpg CHANGED Viewed

Git LFS Details

SHA256: fe203ec5c6e73e7dde25ba4ab4acc231cf0fb81591b37efa2a01e14c35014460
Pointer size: 131 Bytes
Size of remote file: 307 kB

Git LFS Details

SHA256: c18f53822c498a3c1a7d2cc71a1428490950fad5355847baf8985f3f6d6dd8a4
Pointer size: 131 Bytes
Size of remote file: 369 kB

samples/unet_512x704_0.jpg CHANGED Viewed

Git LFS Details

SHA256: 05de9327a514a0117d8e40488a30e0c29692bd970fae214729d4542ae7772eef
Pointer size: 131 Bytes
Size of remote file: 417 kB

Git LFS Details

SHA256: 8480233d5ceeb9701f0c041189ee93cbcba1a91b21befe6c3a9bfd349597d2b0
Pointer size: 131 Bytes
Size of remote file: 468 kB

samples/unet_544x704_0.jpg CHANGED Viewed

Git LFS Details

SHA256: 839170e8bf0ffe85f70220e74e136c850a139844471acbb47ed376f23f5794a2
Pointer size: 131 Bytes
Size of remote file: 338 kB

Git LFS Details

SHA256: 8b2d9f7a4e4b37d394c68a9ebe016f3d18c250fde2618ebc35ff015b81cf0892
Pointer size: 131 Bytes
Size of remote file: 336 kB

samples/unet_576x704_0.jpg CHANGED Viewed

Git LFS Details

SHA256: 86eee23cf19ed84e8e89074c34f3a76a49db51f44c2f2548d52143560e416f76
Pointer size: 131 Bytes
Size of remote file: 409 kB

Git LFS Details

SHA256: de45c9182cdd06d8af8470262203d3fa4a7d9533ee8e2e73e99102072b967352
Pointer size: 131 Bytes
Size of remote file: 561 kB

samples/unet_608x704_0.jpg CHANGED Viewed

Git LFS Details

SHA256: c06fd992f4f896caab882dd82120205ce80bac7586650c82a0b68f7a7800dd77
Pointer size: 131 Bytes
Size of remote file: 800 kB

Git LFS Details

SHA256: 8f834749ea32033fb639f5c98437a5a55d0eb5bdf2f4747dde47e46c8fa75ab5
Pointer size: 131 Bytes
Size of remote file: 767 kB

samples/unet_640x704_0.jpg CHANGED Viewed

Git LFS Details

SHA256: dd5eab7fbe0c4be16c93db31e28574d19351d9ccd16ed4e387b598899473f645
Pointer size: 131 Bytes
Size of remote file: 647 kB

Git LFS Details

SHA256: 95417d4b52af113a45acde9cd81554e3bab496bc50ef11fd5555ca183c1559bf
Pointer size: 131 Bytes
Size of remote file: 554 kB

samples/unet_672x704_0.jpg CHANGED Viewed

Git LFS Details

SHA256: f7e1797579d0782b5230120add98cd1271593e63789bf3733aa8f09935d8e535
Pointer size: 131 Bytes
Size of remote file: 229 kB

Git LFS Details

SHA256: 23218ba170a001403cce9cfd950fac911a8ecb283acc50d9a0d863e68c331186
Pointer size: 131 Bytes
Size of remote file: 340 kB

samples/unet_704x384_0.jpg CHANGED Viewed

Git LFS Details

SHA256: c0e67b88e5436ed81c51bc1c446e15dad7ba7ae71213fc749d93249c94dc9cac
Pointer size: 131 Bytes
Size of remote file: 451 kB

Git LFS Details

SHA256: 587d91d13f7e0761379fa8ba84e7fb34cb398885626f1cfda759c8324776680b
Pointer size: 131 Bytes
Size of remote file: 239 kB

samples/unet_704x416_0.jpg CHANGED Viewed

Git LFS Details

SHA256: ac8f291ae9c5cab96d4beea4581e87fbf8da2c608a6ac3b0210280b0445e6770
Pointer size: 131 Bytes
Size of remote file: 233 kB

Git LFS Details

SHA256: e43513c302b52131886438ba790044a2aeeb9afdfe5254a80389c5a3a165bcaf
Pointer size: 131 Bytes
Size of remote file: 229 kB

samples/unet_704x448_0.jpg CHANGED Viewed

Git LFS Details

SHA256: 955d1de974df209495430e6e1822a4d741983a6575dbb7b6e498465a4c8c4a3e
Pointer size: 131 Bytes
Size of remote file: 208 kB

Git LFS Details

SHA256: 86e8bbe945a0b148f7d7508e1a093e8ad95cc7ed440cc724da526f5472a1af79
Pointer size: 131 Bytes
Size of remote file: 369 kB

samples/unet_704x480_0.jpg CHANGED Viewed

Git LFS Details

SHA256: 0c492b7890843baaf0bc0555b37982b0f728c99ad2f91b992b8c394d5c9c2ff4
Pointer size: 131 Bytes
Size of remote file: 346 kB

Git LFS Details

SHA256: 615eeff210b8971ffc6b18f4a84193b543056f77ba9b10f70cd1fb6f9e2ca574
Pointer size: 131 Bytes
Size of remote file: 293 kB

samples/unet_704x512_0.jpg CHANGED Viewed

Git LFS Details

SHA256: 42df9462770f2a75d17682def582478c0a44f1fe5cb1c945089fc21c046f1ea8
Pointer size: 131 Bytes
Size of remote file: 473 kB

Git LFS Details

SHA256: ff5aff046047598406af9a8a55857457b2c5e0a59e3426453c636157edc75f63
Pointer size: 131 Bytes
Size of remote file: 261 kB

samples/unet_704x544_0.jpg CHANGED Viewed

Git LFS Details

SHA256: b904cb5b5a3e3dc1d094db9cd18a2f4a1e9aa10de50462e7c07c029f1399cbcc
Pointer size: 131 Bytes
Size of remote file: 197 kB

Git LFS Details

SHA256: 943a8babb233d5bca7c76664cf9d9b9a75c1e3d240157c565a80ad337de95624
Pointer size: 131 Bytes
Size of remote file: 351 kB

samples/unet_704x576_0.jpg CHANGED Viewed

Git LFS Details

SHA256: edace3a43408b1f3d0d35dfa4d189fe72bcb3d8706e06226125b64124c810cef
Pointer size: 131 Bytes
Size of remote file: 455 kB

Git LFS Details

SHA256: c2782beac2d5d0aa02b089980159d5b588070c30d5de2093f47c5353b70ade3e
Pointer size: 131 Bytes
Size of remote file: 505 kB

samples/unet_704x608_0.jpg CHANGED Viewed

Git LFS Details

SHA256: 87b573f47d0316643de6b2cf15bad4bea5cb3c1e0a15319d7250e514ef0ce34c
Pointer size: 131 Bytes
Size of remote file: 377 kB

Git LFS Details

SHA256: b4ca3d1e1767a81e369e8d05b83f040eb8141bc2d2d59432f314e906cc0d6a72
Pointer size: 131 Bytes
Size of remote file: 450 kB

samples/unet_704x640_0.jpg CHANGED Viewed

Git LFS Details

SHA256: 95bfe90417f6660fc9a610e334365a23311d1086b5f862f1f25c74f2ceb9ae98
Pointer size: 131 Bytes
Size of remote file: 348 kB

Git LFS Details

SHA256: be8a1e4e0040438ccf15e6686739d9a9f441c9665b286096a64d49b8bf93ce4e
Pointer size: 131 Bytes
Size of remote file: 243 kB

samples/unet_704x672_0.jpg CHANGED Viewed

Git LFS Details

SHA256: d44747d9162bcece687bec005622f43b3b457e065b8b136728487d39e63db440
Pointer size: 131 Bytes
Size of remote file: 787 kB

Git LFS Details

SHA256: ed33a29dfa17e16a1a6123fc7244762c993de55608d6f4e37b0951659193cdf3
Pointer size: 131 Bytes
Size of remote file: 419 kB

samples/unet_704x704_0.jpg CHANGED Viewed

Git LFS Details

SHA256: 3bc6f506a521d092de01061966a3da1833eb658325d02c5c9bc5193a7689e5f9
Pointer size: 131 Bytes
Size of remote file: 899 kB

Git LFS Details

SHA256: 472f745d5230e5e23eb0f818a6d6f0fb52538fa31e66cc77fc370087d74b0279
Pointer size: 131 Bytes
Size of remote file: 489 kB

src/unet1.5b.ipynb CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:0e8e3028e9acfe5c8bf1cf2cb3a371eb91405c8080a120510172768bd86009ba
 size 45191

 version https://git-lfs.github.com/spec/v1
+oid sha256:bc106e53f10b9fd143231839045c4fab5413c64a4d3f096304d1c689682299a8
 size 45191

test.ipynb CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:ba1d74b4649631d547cda5b137cb609e39ee1359d7025b2fb2a68e19424f041a
-size 9035778

 version https://git-lfs.github.com/spec/v1
+oid sha256:c290bd0cd5bed79850d835c7cbd8c556ef02fc9cbf01cdf8f75229db879fe710
+size 13053363

train.py CHANGED Viewed

@@ -35,13 +35,13 @@ ds_path = "datasets/ds1234_noanime_704_vae8x16x"
 project = "unet"
 gpu_mem_gb = torch.cuda.get_device_properties(0).total_memory / 1e9
-local_bs = max(1, int((gpu_mem_gb / 32) * 8))
 num_gpus = torch.cuda.device_count()
 batch_size = local_bs * num_gpus
 base_learning_rate = 4e-5
 min_learning_rate = 4e-6
-learning_rate_scale = 5 # 5 - finetune (small details), 1 - pretrain
 base_learning_rate = base_learning_rate / learning_rate_scale
 min_learning_rate = min_learning_rate / learning_rate_scale
 print(f"Calculated params max-lr:{base_learning_rate} min-lr:{min_learning_rate} GPUs: {num_gpus}, Global BS: {batch_size}")

 project = "unet"
 gpu_mem_gb = torch.cuda.get_device_properties(0).total_memory / 1e9
+local_bs = max(1, int((gpu_mem_gb / 32) * 7))
 num_gpus = torch.cuda.device_count()
 batch_size = local_bs * num_gpus
 base_learning_rate = 4e-5
 min_learning_rate = 4e-6
+learning_rate_scale = 1 # 5 - finetune (small details), 1 - pretrain
 base_learning_rate = base_learning_rate / learning_rate_scale
 min_learning_rate = min_learning_rate / learning_rate_scale
 print(f"Calculated params max-lr:{base_learning_rate} min-lr:{min_learning_rate} GPUs: {num_gpus}, Global BS: {batch_size}")

unet/config.json CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:0a85ea1867dbee11485b2de5f5777cf16f5c5a2ed261dba0a465f5c649092299
 size 1879

 version https://git-lfs.github.com/spec/v1
+oid sha256:fbb20721f35fd23f45183d6c2341c319ac059296734c98c786b278c7a42e2f50
 size 1879

unet/diffusion_pytorch_model.safetensors CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:cb299eef5f0c3e0e3ed02691466c474ef240eb5059a2742a6ca94c8c744234f8
-size 3147092928

 version https://git-lfs.github.com/spec/v1
+oid sha256:35324c7f8ccdc476548954c82f76f6a38528201b7a514094dab2e8810519f47e
+size 6420443856