DiffusionVL-Qwen2.5

DiffusionVL model with SigLIP vision encoder, PoolerProjector, and Qwen2.5 LLM with BD3LM diffusion-based generation.

Usage

from transformers import AutoModelForCausalLM, AutoProcessor
import torch
from PIL import Image

# Load model
model = AutoModelForCausalLM.from_pretrained(
    "path/to/model",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True
)

# Load processor
processor = AutoProcessor.from_pretrained("path/to/model", trust_remote_code=True)

# Prepare inputs
image = Image.open("image.jpg").convert("RGB")
messages = [
    {"role": "user", "content": [
        {"type": "image"},
        {"type": "text", "text": "Describe this image."}
    ]}
]
text = processor.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], images=[image], return_tensors="pt", padding=True)
inputs = {k: v.to(model.device) if hasattr(v, 'to') else v for k, v in inputs.items()}

# Generate
output_ids = model.generate(
    inputs=inputs["input_ids"],
    images=inputs.get("pixel_values"),
    gen_length=256,
    steps=8,
    temperature=0.0,
    remasking_strategy="low_confidence_static",
)

# Decode
output_text = processor.decode(output_ids[0], skip_special_tokens=True)
print(output_text)

Model Configuration

  • Architecture: DiffusionVL_Qwen2_5_ForConditionalGeneration
  • Vision Encoder: SigLIP (384x384, patch_size=14)
  • MM Projector: PoolerProjector (Conv2d + MLP)
  • LLM: Qwen2.5 (standard RoPE)
  • BD3LM Enabled: True
  • Block Size: 8
  • Hidden Size: 3584
  • Num Layers: 28
Downloads last month
12
Safetensors
Model size
8B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including hustvl/DiffusionVL-Qwen2.5-7B