Qwen Image ModelOpt FP8 SGLang Transformer

This repository contains a SGLang-ready ModelOpt FP8 transformer override for Qwen/Qwen-Image. It only replaces the transformer weights; tokenizer, scheduler, VAE, and other non-transformer components are loaded from the original base model.

The checkpoint is intended for SGLang Diffusion with the Qwen Image FP8 support from sgl-project/sglang#23155.

Usage

sglang generate \
  --backend=sglang \
  --model-id=Qwen-Image \
  --model-path Qwen/Qwen-Image \
  --transformer-path BBuf/Qwen-Image-ModelOpt-FP8-SGLang \
  --prompt "A futuristic cyberpunk city at night, neon lights reflecting on wet streets" \
  --width=1024 \
  --height=1024 \
  --num-inference-steps=50 \
  --guidance-scale=4.0 \
  --seed=42 \
  --num-gpus=1 \
  --dit-cpu-offload false \
  --dit-layerwise-offload false \
  --warmup \
  --save-output

H100 Validation Snapshot

Validation was run on one H100 GPU using rank0 with --backend=sglang. The FP8 image below is from the fixed checkpoint after keeping the validated sensitive Qwen Image fallback tensors in BF16.

Artifacts:

Validation tree: validation/
BF16 command: validation/commands/bf16_qwen_image_1024_50_benchmark.sh
FP8 command: validation/commands/fp8_fixed_qwen_image_1024_50_benchmark.sh
Benchmark comparison: qwen_image_bf16_vs_fp8_fixed_1024_50_compare.md
Profiler traces: BF16, FP8 fixed

BF16, 1024x1024, 50 steps	FP8 fixed, 1024x1024, 50 steps

Benchmark, warmup excluded:

Metric	BF16	FP8 fixed	Delta	Speedup
E2E latency	13.589 s	12.159 s	-1.430 s (-10.5%)	1.12x
Denoising stage	12.929 s	11.437 s	-1.491 s (-11.5%)	1.13x
Decoding stage	58.55 ms	52.30 ms	-6.25 ms (-10.7%)	1.12x
Text encoding	599.85 ms	666.43 ms	+66.57 ms (+11.1%)	0.90x

Notes:

Validation prompt: A futuristic cyberpunk city at night, neon lights reflecting on wet streets.
Validation settings: 1024x1024, 50 inference steps, guidance_scale=4.0, seed=42, --dit-cpu-offload false, --dit-layerwise-offload false, --warmup.
Profiler artifacts were captured separately with profiler flags; those profiler timings include profiling overhead and are not used as benchmark latency numbers.

Conversion Notes

The checkpoint was converted from a NVIDIA ModelOpt FP8 export with SGLang's build_modelopt_fp8_transformer tool. Most linear weights are FP8. The validated fallback set keeps numerically sensitive tensors in BF16, including the Qwen Image image-MLP output projection family needed for normal image quality.

Downloads last month: 121

Model tree for BBuf/Qwen-Image-ModelOpt-FP8-SGLang

Base model

Qwen/Qwen-Image

Quantized

(25)

this model