unsloth/Qwen3.5-35B-A3B-GGUF · Update: Should now be Fixed - Bug in UD-Q4_K_XL recipe using MXFP4 for attn tensors and experts?

Unsloth AI org Feb 26

•

edited Feb 27 by

shimmyshimmer

Update: Now fixed, see: https://unsloth.ai/docs/models/qwen3.5/gguf-benchmarks

Hey @ubergarm ! Great work as usual! I'm investigating currently on UD-Q4_K_XL - I recently switched to using MXFP4, but as you noted my script most likely had some issues somewhere - I will update the community asap.

Thank you again for the investigation and also the rest of the community's patience - we highly appreciate it.

For now , MXFP4_MOE which is also partially dynamic is the correct option as of now, or using Q4_K_M which also uses our imatrix calibration dataset.

I will update everyone hopefully soon - and thank you to the community for the constant support .

ubergarm

Feb 26

It happens! Oh man I have so many little copy paste issues between the scripts, and with like five Qwen models coming out in a week all with similar but slightly different names its been a lot haha...

Thanks Daniel and you and your bro and team get some rest! Cheers!

Unsloth AI org Feb 26

Thank you immensely!

Fizzarolli

Feb 26

•

Q2 and Q3 also have tensors set to MXFP4.

Furthermore, at least Qwen3 Coder Next's UD GGUFs have the same issue, if not more.

Fizzarolli

Feb 26

Qwen 3 Coder Next has other weights forced to MXFP4 in even the non-UD quants as well.

CHNtentes

Feb 26

Can someone help me understand this issue? I thought MXFP4 was not a "bad" quant type but it seems to cause trouble here?

ubergarm

Feb 26

@CHNtentes

There are no bad quant types, just incorrect choices for specific tensors. Generally MXFP4 is used specifically for routed experts ffn_(gate|down|up)_exps and not for attn.* tensors. Also the label mixes like UD-Q4_K_XL implies there would not be any MXFP4 tensors but be a mix of q4_K and similar.

Accidentally using MXFP4 in the wrong place could lead to lower quality than could otherwise be achieved.

You can see some perplexity and KLD data in the reddit thread linked above which shows a measurable discrepancy. This is what alerted folks to the potential issue.

@Fizzarolli

Thanks for checking further, hopefully the bug didn't propagate too far or long, but yes that looks unfortunate.

Unsloth AI org Feb 26

•

edited Feb 26 by

shimmyshimmer

Thank you @Fizzarolli for the help! I'm checking why MXFP4 got injected in the XL quant specifically - it definitely seems like my switch to MXFP4 had some issues - so sorry on the issues - will update everyone soon

intulint

Feb 26

I honestly don't understand much about this, but I also intuitively feel that the presence of MXFP4 quanta in the layers can degrade the generation results. If I remember correctly, this coarser quantization isn't suitable for all.

Feb 26

•

@danielhanchen

I remember in your post you say XL quants are for the combination of best quality and speed. How are you measuring the mix of speed and quality? I see q6_K and q5_K being slow. My hunch is that mxfp4 that uses mxfp4 and q8 might be fastest and best quality even though mxfp4 itself might not be most accurate. Is there possibility of iq4_nl and q8 mix?

Here is analysis on speed on my machine for vulkan. Based on your back-end and GPU it might be different for you.

My Conclusion for speed based on quant-bench:

Overall Rank	Quant Type	PP TOPS (Rank)	TG TOPS (Rank)	Average Rank
1	iq2_xs	1.52 (1)	0.22 (1)	1.0
2	iq2_xxs	1.52 (1)	0.18 (2)	1.5
3	iq4_nl	1.50 (3)	0.13 (6)	4.5
4	iq2_s	1.43 (7)	0.18 (2)	4.5
5	mxfp4	1.47 (5)	0.13 (6)	5.5
6	iq3_xxs	1.43 (7)	0.17 (4)	5.5
7	q5_0	1.44 (6)	0.12 (8)	7.0
8	q4_0	1.31 (13)	0.15 (5)	9.0
9	iq3_s	1.42 (9)	0.12 (8)	8.5
10	q8_0	1.50 (3)	0.08 (13)	8.0
11	q4_K	1.40 (10)	0.11 (10)	10.0
12	q5_K	1.37 (11)	0.11 (10)	10.5
13	iq4_xs	1.32 (12)	0.11 (10)	11.0
14	q6_K	1.27 (14)	0.08 (13)	13.5

My quant-bench Data:

=== Prompt Processing (Prefill) Phase (Batch Size = 512) ===

Quant	Time (us)	TOPS
q4_0	45852.47	1.31
q4_K	43057.02	1.40
q5_0	41751.17	1.44
q5_K	44029.87	1.37
q6_K	47377.60	1.27
q8_0	40007.54	1.50
iq2_xxs	39476.48	1.52
iq2_xs	39648.50	1.52
iq2_s	42163.56	1.43
iq3_xxs	42061.97	1.43
iq3_s	42290.59	1.42
iq4_nl	40171.84	1.50
iq4_xs	45535.99	1.32
mxfp4	40768.00	1.47

=== Token Generation (Decoding) Phase (Batch Size = 1) ===

Quant	Time (us)	TOPS	Eff. BW (GB/s)
q4_0	796.95	0.15	41.54
q4_K	1029.23	0.11	32.16
q5_0	961.21	0.12	42.08
q5_K	1103.46	0.11	36.65
q6_K	1433.84	0.08	33.65
q8_0	1513.28	0.08	41.28
iq2_xxs	641.39	0.18	23.72
iq2_xs	522.21	0.22	32.64
iq2_s	637.26	0.18	29.63
iq3_xxs	683.46	0.17	33.00
iq3_s	959.73	0.12	26.37
iq4_nl	913.31	0.13	36.25
iq4_xs	1053.45	0.11	29.68
mxfp4	875.12	0.13	35.73

intulint

Feb 26

@engrtipusultan
I just want to note that not all quants maintain good speed when running in hybrid or even pure CPU mode.

Feb 26

•

@engrtipusultan
I just want to note that not all quants maintain good speed when running in hybrid or even pure CPU mode.

Sorry I did not understand that, can you explain more.

I have AMD APU so I am personally using vulkan only no offloading to CPU. I think same might be the case for apple devices where there is unified memory.
Also I do understand speed of quant differ on different hardware and backend. For example on hardware where BF16 or MXFP4 is supported they are much faster. Similary I quants are slower on certain backends etc.

Similarly hybrid inference CPU plus GPU might have different results.

gkubon

Feb 26

So MXFP4 layers shouldnt be in UD quants? vide : https://huggingface.co/unsloth/Qwen3-Coder-Next-GGUF?show_file_info=Qwen3-Coder-Next-UD-Q4_K_XL.gguf

intulint

Feb 26

@engrtipusultan
There used to be a table in the llama.cpp repository, but it seems to be outdated and removed. I couldn't find it. I only remember that not all quants had good performance on all backends. For example, IQ quants was only for GPU, it was slower on CPU and hybrid. I don't know how it is now.

Feb 26

•

@engrtipusultan
There used to be a table in the llama.cpp repository, but it seems to be outdated and removed. I couldn't find it. I only remember that not all quants had good performance on all backends. For example, IQ quants was only for GPU, it was slower on CPU and hybrid. I don't know how it is now.

It is still there. Hard to find :P . Based on testing, it seems outdated, atleast for Vulkan backend on my hardware it is incorrect as on 4bit IQNL is showing fastest whereas it says iq quants are suppose to be slow on vulkan backend.

lingyezhixing

Feb 26

Qwen3.5-122B-A10B also has the same issue, it seems MXFP4 is widely used in UD quantization, and it has caused the performance of 122B Q3-level quantization to be much lower than expected, with generation producing garbled text or repetition. I haven't seen this kind of loss in 100B-level models for a long time.

Unsloth AI org Feb 26

Will post some results soon - again apologies on the delay and sorry on the issues - I'll update everyone soon on MXPF4 vs Q4 specifically.

But overall:

All our quants work fine, just some with MXFP4 layers are slightly less performant than Q4 variants
UD-Q4_K_XL is the main issue - I would use MXFP4 instead for example, which still uses some of our dynamic methodology and uses our calibration dataset
There is a tool calling bug which I will also fix - this is not part of our quants, but related to a generic issue about the model.

AI-Joe-git

Feb 26

MXP4 layers make it crash in Vulkan llama CPP with Intel Arc iGPU and GPU.

blankreg

Feb 27

@danielhanchen Do you plan on updating ggufs for 397B, 112B and 35B in the next few hours? I've been waiting for the dust to settle before downloading several GB's worth of data :-) Thanks for your work

mingyi456

Feb 27

@CHNtentes

There are no bad quant types, just incorrect choices for specific tensors. Generally MXFP4 is used specifically for routed experts ffn_(gate|down|up)_exps and not for attn.* tensors. Also the label mixes like UD-Q4_K_XL implies there would not be any MXFP4 tensors but be a mix of q4_K and similar.

Accidentally using MXFP4 in the wrong place could lead to lower quality than could otherwise be achieved.

You can see some perplexity and KLD data in the reddit thread linked above which shows a measurable discrepancy. This is what alerted folks to the potential issue.

@ubergarm Why do you say that MXFP4 is not a bad quant type? I think MXFP4 being used for experts is simply because the GPT-OSS models are natively trained that way, not because the distribution of the weight values in experts specifically "conform" to MXFP4. There is a comment by ikawrakow about using MXFP4 for PTQ, and while I lack any experience with AI/ML, I agree with him that only having a single exponent bit is really suboptimal for accuracy.

Honestly, I believe the purpose of FP4 and other low-precision floats assigning most of the bits to the exponent field is just to cover the widest dynamic range possible while minimizing the need for extra scaling factors (which will complicate direct hardware implementations).

Unsloth AI org Feb 27

•

edited Feb 27 by

shimmyshimmer

Hey all we just updated all our quants! https://huggingface.co/unsloth/Qwen3.5-35B-A3B-GGUF/discussions/10

I wrote a large discussion in https://www.reddit.com/r/LocalLLaMA/comments/1rgel19/new_qwen3535ba3b_unsloth_dynamic_ggufs_benchmarks/

Sorry for the pings but thought y'all should know!
CC: @Firepal3D @itsthenewmeta @SuperbEmphasis @AI-Joe-git @engrtipusultan @lingyezhixing @eleius @anfieldiro @dagbs @elitehawkturd @henry2man @cmh @void-mckenzie @pixelmelt @Dampfinchen @LaikaFramework @usbphone @Tho465 @ryan4yin @klimekop6 @bgonzaga @Chief-Inspector @DevilaN @Pentium95 @MaCgI @alexaione

shimmyshimmer changed discussion title from Bug in UD-Q4_K_XL recipe using MXFP4 for attn tensors and experts? to Fixed: Bug in UD-Q4_K_XL recipe using MXFP4 for attn tensors and experts? Feb 27

shimmyshimmer changed discussion title from Fixed: Bug in UD-Q4_K_XL recipe using MXFP4 for attn tensors and experts? to Update: Should now be Fixed - Bug in UD-Q4_K_XL recipe using MXFP4 for attn tensors and experts? Feb 27

cmh

Feb 27

•

edited Feb 27

TiwitMuffbiscuit here. Thanks a lot! I know it’s a lot of work, but it’s also a lot of good data to work with. You even went as far as testing against all other major quants and left everything available for us to see—that’s a first. Beautiful graphs.

I’ve downloaded everything I needed to put into the Reddit post; I’m just waiting for the god-damn computer to finish a task. Seven hours to go, then I’ll be able to do some tests. In the meantime, I’ll update the post adequately.