Update: Should now be Fixed - Bug in UD-Q4_K_XL recipe using MXFP4 for attn tensors and experts?

#5
by ubergarm - opened

Feb 27 update: Now fixed, see: https://unsloth.ai/docs/models/qwen3.5/gguf-benchmarks

Some folks on reddit were discussing various quants of this model and your UD-Q4_K_XL seemed to be kind of small and under performing on perplexity compared to what folks expected.

Sure enough look inside the tensors and there are MXFP4 tensors mysteriously used for attn_gate and even experts?

blk.0.attn_gate.weight 	[2 048, 4 096] 	MXFP4

https://huggingface.co/unsloth/Qwen3.5-35B-A3B-GGUF?show_file_info=Qwen3.5-35B-A3B-UD-Q4_K_XL.gguf

unsloth-qwen35moe-recipe-bug

Update: Now fixed, see: https://unsloth.ai/docs/models/qwen3.5/gguf-benchmarks

Hey @ubergarm ! Great work as usual! I'm investigating currently on UD-Q4_K_XL - I recently switched to using MXFP4, but as you noted my script most likely had some issues somewhere - I will update the community asap.

Thank you again for the investigation and also the rest of the community's patience - we highly appreciate it.

For now , MXFP4_MOE which is also partially dynamic is the correct option as of now, or using Q4_K_M which also uses our imatrix calibration dataset.

I will update everyone hopefully soon - and thank you to the community for the constant support .

It happens! Oh man I have so many little copy paste issues between the scripts, and with like five Qwen models coming out in a week all with similar but slightly different names its been a lot haha...

Thanks Daniel and you and your bro and team get some rest! Cheers!

Unsloth AI org

Thank you immensely!

Q2 and Q3 also have tensors set to MXFP4.

Furthermore, at least Qwen3 Coder Next's UD GGUFs have the same issue, if not more.

image

image

Qwen 3 Coder Next has other weights forced to MXFP4 in even the non-UD quants as well.

Can someone help me understand this issue? I thought MXFP4 was not a "bad" quant type but it seems to cause trouble here?

@CHNtentes

There are no bad quant types, just incorrect choices for specific tensors. Generally MXFP4 is used specifically for routed experts ffn_(gate|down|up)_exps and not for attn.* tensors. Also the label mixes like UD-Q4_K_XL implies there would not be any MXFP4 tensors but be a mix of q4_K and similar.

Accidentally using MXFP4 in the wrong place could lead to lower quality than could otherwise be achieved.

You can see some perplexity and KLD data in the reddit thread linked above which shows a measurable discrepancy. This is what alerted folks to the potential issue.

@Fizzarolli

Thanks for checking further, hopefully the bug didn't propagate too far or long, but yes that looks unfortunate.

Thank you @Fizzarolli for the help! I'm checking why MXFP4 got injected in the XL quant specifically - it definitely seems like my switch to MXFP4 had some issues - so sorry on the issues - will update everyone soon

I honestly don't understand much about this, but I also intuitively feel that the presence of MXFP4 quanta in the layers can degrade the generation results. If I remember correctly, this coarser quantization isn't suitable for all.

@danielhanchen

I remember in your post you say XL quants are for the combination of best quality and speed. How are you measuring the mix of speed and quality? I see q6_K and q5_K being slow. My hunch is that mxfp4 that uses mxfp4 and q8 might be fastest and best quality even though mxfp4 itself might not be most accurate. Is there possibility of iq4_nl and q8 mix?

Here is analysis on speed on my machine for vulkan. Based on your back-end and GPU it might be different for you.

My Conclusion for speed based on quant-bench:

Overall Rank Quant Type PP TOPS (Rank) TG TOPS (Rank) Average Rank
1 iq2_xs 1.52 (1) 0.22 (1) 1.0
2 iq2_xxs 1.52 (1) 0.18 (2) 1.5
3 iq4_nl 1.50 (3) 0.13 (6) 4.5
4 iq2_s 1.43 (7) 0.18 (2) 4.5
5 mxfp4 1.47 (5) 0.13 (6) 5.5
6 iq3_xxs 1.43 (7) 0.17 (4) 5.5
7 q5_0 1.44 (6) 0.12 (8) 7.0
8 q4_0 1.31 (13) 0.15 (5) 9.0
9 iq3_s 1.42 (9) 0.12 (8) 8.5
10 q8_0 1.50 (3) 0.08 (13) 8.0
11 q4_K 1.40 (10) 0.11 (10) 10.0
12 q5_K 1.37 (11) 0.11 (10) 10.5
13 iq4_xs 1.32 (12) 0.11 (10) 11.0
14 q6_K 1.27 (14) 0.08 (13) 13.5

My quant-bench Data:

bash  ./quant-bench -d vulkan0
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon Graphics (RADV RENOIR) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: none
Using device: Vulkan0

=== Prompt Processing (Prefill) Phase (Batch Size = 512) ===

Quant Time (us) TOPS
q4_0 45852.47 1.31
q4_K 43057.02 1.40
q5_0 41751.17 1.44
q5_K 44029.87 1.37
q6_K 47377.60 1.27
q8_0 40007.54 1.50
iq2_xxs 39476.48 1.52
iq2_xs 39648.50 1.52
iq2_s 42163.56 1.43
iq3_xxs 42061.97 1.43
iq3_s 42290.59 1.42
iq4_nl 40171.84 1.50
iq4_xs 45535.99 1.32
mxfp4 40768.00 1.47

=== Token Generation (Decoding) Phase (Batch Size = 1) ===

Quant Time (us) TOPS Eff. BW (GB/s)
q4_0 796.95 0.15 41.54
q4_K 1029.23 0.11 32.16
q5_0 961.21 0.12 42.08
q5_K 1103.46 0.11 36.65
q6_K 1433.84 0.08 33.65
q8_0 1513.28 0.08 41.28
iq2_xxs 641.39 0.18 23.72
iq2_xs 522.21 0.22 32.64
iq2_s 637.26 0.18 29.63
iq3_xxs 683.46 0.17 33.00
iq3_s 959.73 0.12 26.37
iq4_nl 913.31 0.13 36.25
iq4_xs 1053.45 0.11 29.68
mxfp4 875.12 0.13 35.73

@engrtipusultan
I just want to note that not all quants maintain good speed when running in hybrid or even pure CPU mode.

@engrtipusultan
I just want to note that not all quants maintain good speed when running in hybrid or even pure CPU mode.

Sorry I did not understand that, can you explain more.

I have AMD APU so I am personally using vulkan only no offloading to CPU. I think same might be the case for apple devices where there is unified memory.
Also I do understand speed of quant differ on different hardware and backend. For example on hardware where BF16 or MXFP4 is supported they are much faster. Similary I quants are slower on certain backends etc.

Similarly hybrid inference CPU plus GPU might have different results.

@engrtipusultan
There used to be a table in the llama.cpp repository, but it seems to be outdated and removed. I couldn't find it. I only remember that not all quants had good performance on all backends. For example, IQ quants was only for GPU, it was slower on CPU and hybrid. I don't know how it is now.

@engrtipusultan
There used to be a table in the llama.cpp repository, but it seems to be outdated and removed. I couldn't find it. I only remember that not all quants had good performance on all backends. For example, IQ quants was only for GPU, it was slower on CPU and hybrid. I don't know how it is now.

It is still there. Hard to find :P . Based on testing, it seems outdated, atleast for Vulkan backend on my hardware it is incorrect as on 4bit IQNL is showing fastest whereas it says iq quants are suppose to be slow on vulkan backend.

Qwen3.5-122B-A10B also has the same issue, it seems MXFP4 is widely used in UD quantization, and it has caused the performance of 122B Q3-level quantization to be much lower than expected, with generation producing garbled text or repetition. I haven't seen this kind of loss in 100B-level models for a long time.

Unsloth AI org

Will post some results soon - again apologies on the delay and sorry on the issues - I'll update everyone soon on MXPF4 vs Q4 specifically.

But overall:

  1. All our quants work fine, just some with MXFP4 layers are slightly less performant than Q4 variants
  2. UD-Q4_K_XL is the main issue - I would use MXFP4 instead for example, which still uses some of our dynamic methodology and uses our calibration dataset
  3. There is a tool calling bug which I will also fix - this is not part of our quants, but related to a generic issue about the model.

MXP4 layers make it crash in Vulkan llama CPP with Intel Arc iGPU and GPU.

@danielhanchen Do you plan on updating ggufs for 397B, 112B and 35B in the next few hours? I've been waiting for the dust to settle before downloading several GB's worth of data :-) Thanks for your work

@CHNtentes

There are no bad quant types, just incorrect choices for specific tensors. Generally MXFP4 is used specifically for routed experts ffn_(gate|down|up)_exps and not for attn.* tensors. Also the label mixes like UD-Q4_K_XL implies there would not be any MXFP4 tensors but be a mix of q4_K and similar.

Accidentally using MXFP4 in the wrong place could lead to lower quality than could otherwise be achieved.

You can see some perplexity and KLD data in the reddit thread linked above which shows a measurable discrepancy. This is what alerted folks to the potential issue.

@ubergarm Why do you say that MXFP4 is not a bad quant type? I think MXFP4 being used for experts is simply because the GPT-OSS models are natively trained that way, not because the distribution of the weight values in experts specifically "conform" to MXFP4. There is a comment by ikawrakow about using MXFP4 for PTQ, and while I lack any experience with AI/ML, I agree with him that only having a single exponent bit is really suboptimal for accuracy.

Honestly, I believe the purpose of FP4 and other low-precision floats assigning most of the bits to the exponent field is just to cover the widest dynamic range possible while minimizing the need for extra scaling factors (which will complicate direct hardware implementations).

shimmyshimmer changed discussion title from Bug in UD-Q4_K_XL recipe using MXFP4 for attn tensors and experts? to Fixed: Bug in UD-Q4_K_XL recipe using MXFP4 for attn tensors and experts?
shimmyshimmer changed discussion title from Fixed: Bug in UD-Q4_K_XL recipe using MXFP4 for attn tensors and experts? to Update: Should now be Fixed - Bug in UD-Q4_K_XL recipe using MXFP4 for attn tensors and experts?

TiwitMuffbiscuit here. Thanks a lot! I know it’s a lot of work, but it’s also a lot of good data to work with. You even went as far as testing against all other major quants and left everything available for us to see—that’s a first. Beautiful graphs.

I’ve downloaded everything I needed to put into the Reddit post; I’m just waiting for the god-damn computer to finish a task. Seven hours to go, then I’ll be able to do some tests. In the meantime, I’ll update the post adequately.

I have Vega8 and Vulkan backend, poor gpu club, in past, testing Q8 has always better PP than Q4 and Q4 always have better TG than Q8.

Shared tests show that all have almost same TG and PP also PP is marginally higher at q4 than q8. Which reverse of what I tested in past.

Can other community members confirm what are the case for them ?

Qwen3.5-122B-A10B also has the same issue, it seems MXFP4 is widely used in UD quantization, and it has caused the performance of 122B Q3-level quantization to be much lower than expected, with generation producing garbled text or repetition. I haven't seen this kind of loss in 100B-level models for a long time.

I have been seeing those same issues with Q8 as well.

Qwen3.5-122B-A10B also has the same issue, it seems MXFP4 is widely used in UD quantization, and it has caused the performance of 122B Q3-level quantization to be much lower than expected, with generation producing garbled text or repetition. I haven't seen this kind of loss in 100B-level models for a long time.

Qwen3-Coder-Next as well. Hope to see new UD GGUFs...

Sign up or log in to comment