fp8_e5m2?

#98
by TheNecr0mancer - opened

Hey could you please make an fp8_e5m2 quant of your models also when you release them?
If you make this version, skipping bf16 and going straight from source, the speed can be up to 50% faster for a lot of users. (I am one of them.)

Thanks

tried q8_0? reports say its 60% faster

tried q8_0? reports say its 60% faster

let me know when v9 Q8 is uploaded.

image
thats the whole gguf stuff i guess 3hrs

image
thats the whole gguf stuff i guess 3hrs

just tested your 5.3 GGUF, and it's exactly the same speed. I still have a request for E5m2 straight from source avoiding intermediate quant, especially bf16 as this causes a large part of the slow down.

click on the v9 safetensors here in hf. there is a workflow, paste it into chatgpt and say you want a json.
load in comfyui and change base and fp8 run and test.

you need around 100gb ram or a night if swap is big enough.

thats why i didnt try it. the model alone in bf16 is 40gb fp32 is 80gb.

and see the base is just bf16 so fp32 isnt available at all so the whole process is useless.

Screenshot_20251107-193510_Chrome

the only image i could find is https://huggingface.co/patientxtr/qwen-image-edit-fp8-e5m2/tree/main

maybe test and compare speed if thats truth or nonsense.

Thanks for finding that for me.

The speed increase is only roughly 5-9% gain on that model.
So either they did not quantize optimally, or the vision model is the hindering factor here.
Now trying to find an e5m2 of the TE, but doesn't seem to exist.

A good example for a huge performance gain for e4m3 vs e5m2 is the regular Qwen image fp8, vs JIB mix fp8 e5m2, I actually get nearly double speed with that JIB model.

Sign up or log in to comment