WAI REALCN (SDXL)

Photorealistic Stable Diffusion XL checkpoint released by the community as “WAI REALCN”. The model keeps the standard SDXL architecture (two CLIP text encoders, latent UNet, and VAE) and was shared on Civitai.

Model Summary

Task: text-to-image generation at 1024×1024 (and downscaled resolutions).
Architecture: SDXL with two CLIP text encoders (CLIPTextModel + CLIPTextModelWithProjection), UNet with cross-attention, and an AutoencoderKL VAE (scaling factor 0.13025).
Scheduler: EulerDiscreteScheduler by default; other SDXL schedulers from diffusers also work.
Format: Diffusers pipeline (StableDiffusionXLPipeline) with FP16 weights expected at load time for GPU inference.

Recommended Use

Photorealistic portraits and lifestyle imagery; neutral prompting works best (avoid over-stylized prompts).
Works with standard SDXL negative prompting (e.g., “blurry, low quality, artifacts, extra limbs”).
1024×1024 is the native resolution; smaller sizes are fine, higher may need upscaling.

Quickstart (Diffusers)

import torch
from diffusers import StableDiffusionXLPipeline

pipe = StableDiffusionXLPipeline.from_pretrained(
    "YOUR_USERNAME_HERE/deewaiREALCN",
    torch_dtype=torch.float16,
).to("cuda")

prompt = "a candid street portrait of a young adult, soft daylight, shallow depth of field, high detail"
negative_prompt = "blurry, low quality, extra fingers, distorted face"

image = pipe(
    prompt=prompt,
    negative_prompt=negative_prompt,
    num_inference_steps=30,
    guidance_scale=7.5,
).images[0]

image.save("sample.png")

Files and Architecture Notes

model_index.json: Declares StableDiffusionXLPipeline with dual tokenizers/encoders (standard SDXL design).
tokenizer/ & tokenizer_2/: Separate CLIP tokenizers matching the two text encoders; keep both to preserve padding/special-token behavior.
text_encoder/: 12-layer CLIP text encoder (768 hidden size, quick GELU).
text_encoder_2/: 32-layer CLIP text encoder with projection (1280 hidden size, GELU).
unet/: Latent UNet with cross-attention (sample_size: 128 → 1024px images).
vae/: AutoencoderKL with scaling_factor: 0.13025 for latents.
scheduler/: Default Euler scheduler settings.

Prompting Tips

Start concise: subject + setting + lighting + camera feel (e.g., “portrait, indoor window light, 85mm, f/1.8”).
Add quality anchors sparingly (“high detail”, “natural skin”, “cinematic lighting”).
Keep negatives short; overlong negatives can reduce fidelity.

Safety and Limitations

May reproduce biases or create sensitive/NSFW content; review outputs before use.
Not guaranteed for medical, legal, or safety-critical applications.
Respect the CreativeML Open RAIL-M license; comply with downstream use restrictions.

More information about the construct of the model

It’s a text-to-image diffusion pipeline with the usual components:

tokenizer + text_encoder: turns your prompt into embeddings
tokenizer_2 + text_encoder_2: a second, larger text encoder (so it’s dual-encoder)
UNet2DConditionModel: the main denoiser that predicts noise at each diffusion step
VAE (AutoencoderKL): converts images ↔ latent space
EulerDiscreteScheduler: controls the denoising step schedule (Euler sampler)

This structure is typical of SDXL-like pipelines (two CLIP text encoders, big UNet, VAE, Euler scheduler).

2) Two text encoders (why and what “projection” means)

You have:

Text Encoder (12-layer, 768d, 123M params) This looks like a CLIP ViT-L/14 style text tower size.
Text Encoder 2 (32-layer, 1280d, 695M params) with projection “WithProjection” means it outputs embeddings and applies a learned projection head to a target dimension (commonly used in SDXL so the model can combine conditioning streams cleanly).

Net: prompts are encoded twice (often “pooled” + “per-token” info), giving richer conditioning.

3) UNet config tells you image resolution and how heavy the model is

Key fields:

sample_size=128 (latent 4 x 128 x 128) Latents are 128×128. Since SD VAEs typically downsample by 8×, this corresponds to about 1024×1024 output resolution (128×8 = 1024).
cross_attention_dim=2048 The UNet expects conditioning vectors of width 2048. With two encoders (768 and 1280), those concatenate to 2048 (768 + 1280), so that’s a big hint this is SDXL-style dual-conditioning.
block_out_channels=[320, 640, 1280], attention heads scale as [5, 10, 20] That’s the channel widths per stage and how attention capacity increases deeper in the network.
transformer_layers_per_block=[1, 2, 10] Deepest blocks have many transformer layers (10), which is part of why it’s huge.
params=2567.46M The UNet alone is 2.57B parameters, which is very large (this is in the “XL/bigger-than-XL” territory).

Also:

addition_embed_type=text_time suggests it injects extra conditioning (text + timestep style embedding).

4) VAE details (encoding/decoding)

latent_channels=4: standard SD latent format.
scaling_factor=0.13025: how latents are scaled when passed between UNet and VAE (Diffusers uses this internally).
force_upcast=True: during decode/encode it may upcast to float32 for numerical stability (helps avoid artifacts, but costs memory).

5) Total size and what it implies for VRAM

Rough total params:

Text encoders: 123M + 695M ≈ 818M
UNet: 2567M
VAE: 84M Total ≈ 3.47B parameters.

Implication: this is a very heavy pipeline. In fp16/bf16 the raw weights alone are multiple GB, and runtime activations add a lot more. You typically need:

aggressive memory tricks (attention slicing, xFormers, CPU offload), or
a large VRAM GPU for comfortable 1024×1024 generation.

6) The `torch_dtype` warning

Diffusers changed API naming: instead of passing torch_dtype=... you should pass dtype=... in whatever wrapper is printing that warning.

If you tell me which model/repo this is (or the pipeline class you’re using), I can translate these facts into practical settings (recommended dtype, VRAM-saving flags, and expected output resolution and speed).

Downloads last month: 171

Model tree for telcom/deewaiREALCN

Base model

stabilityai/stable-diffusion-xl-base-1.0

Finetuned

(1240)

this model

telcom
/

deewaiREALCN

WAI REALCN (SDXL)

Model Summary

Recommended Use

Quickstart (Diffusers)

Files and Architecture Notes

Prompting Tips

Safety and Limitations

More information about the construct of the model

2) Two text encoders (why and what “projection” means)

3) UNet config tells you image resolution and how heavy the model is

4) VAE details (encoding/decoding)

5) Total size and what it implies for VRAM

6) The `torch_dtype` warning

Model tree for telcom/deewaiREALCN

Space using telcom/deewaiREALCN 1

WAI REALCN (SDXL)

Model Summary

Recommended Use

Quickstart (Diffusers)

Files and Architecture Notes

Prompting Tips

Safety and Limitations

More information about the construct of the model

2) Two text encoders (why and what “projection” means)

3) UNet config tells you image resolution and how heavy the model is

4) VAE details (encoding/decoding)

5) Total size and what it implies for VRAM

6) The torch_dtype warning

Model tree for telcom/deewaiREALCN

Space using telcom/deewaiREALCN 1

6) The `torch_dtype` warning