Qwen Image ModelOpt FP8 SGLang Transformer
This repository contains a SGLang-ready ModelOpt FP8 transformer override for Qwen/Qwen-Image.
It only replaces the transformer weights; tokenizer, scheduler, VAE, and other non-transformer components are loaded from the original base model.
The checkpoint is intended for SGLang Diffusion with the Qwen Image FP8 support from sgl-project/sglang#23155.
Usage
sglang generate \
--backend=sglang \
--model-id=Qwen-Image \
--model-path Qwen/Qwen-Image \
--transformer-path BBuf/Qwen-Image-ModelOpt-FP8-SGLang \
--prompt "A futuristic cyberpunk city at night, neon lights reflecting on wet streets" \
--width=1024 \
--height=1024 \
--num-inference-steps=50 \
--guidance-scale=4.0 \
--seed=42 \
--num-gpus=1 \
--dit-cpu-offload false \
--dit-layerwise-offload false \
--warmup \
--save-output
H100 Validation Snapshot
Validation was run on one H100 GPU using rank0 with --backend=sglang. The FP8 image below is from the fixed checkpoint after keeping the validated sensitive Qwen Image fallback tensors in BF16.
Artifacts:
- Validation tree:
validation/ - BF16 command:
validation/commands/bf16_qwen_image_1024_50_benchmark.sh - FP8 command:
validation/commands/fp8_fixed_qwen_image_1024_50_benchmark.sh - Benchmark comparison:
qwen_image_bf16_vs_fp8_fixed_1024_50_compare.md - Profiler traces: BF16, FP8 fixed
Benchmark, warmup excluded:
| Metric | BF16 | FP8 fixed | Delta | Speedup |
|---|---|---|---|---|
| E2E latency | 13.589 s | 12.159 s | -1.430 s (-10.5%) | 1.12x |
| Denoising stage | 12.929 s | 11.437 s | -1.491 s (-11.5%) | 1.13x |
| Decoding stage | 58.55 ms | 52.30 ms | -6.25 ms (-10.7%) | 1.12x |
| Text encoding | 599.85 ms | 666.43 ms | +66.57 ms (+11.1%) | 0.90x |
Notes:
- Validation prompt:
A futuristic cyberpunk city at night, neon lights reflecting on wet streets. - Validation settings:
1024x1024,50inference steps,guidance_scale=4.0,seed=42,--dit-cpu-offload false,--dit-layerwise-offload false,--warmup. - Profiler artifacts were captured separately with profiler flags; those profiler timings include profiling overhead and are not used as benchmark latency numbers.
Conversion Notes
The checkpoint was converted from a NVIDIA ModelOpt FP8 export with SGLang's build_modelopt_fp8_transformer tool.
Most linear weights are FP8. The validated fallback set keeps numerically sensitive tensors in BF16, including the Qwen Image image-MLP output projection family needed for normal image quality.
- Downloads last month
- 121
Model tree for BBuf/Qwen-Image-ModelOpt-FP8-SGLang
Base model
Qwen/Qwen-Image
