MXFP4 vs Q8 vs bf16

by InformaticsSolutions - opened Nov 19, 2025

Nov 19, 2025

They are all the same size. Was wondering why not just go with the highest quant (best accuracy)? In my tests q8 and bf16 are equally fast (or slow!), at ~20tps. Thanks.,

xanthan666

Nov 19, 2025

On the model card of the vanilla quant Bartowski explained:
The reason is, the FFN (feed forward networks) of gpt-oss do not behave nicely when quantized to anything other than MXFP4, so they are kept at that level for everything.

The rest of these are provided for your own interest in case you feel like experimenting, but the size savings is basically non-existent so I would not recommend running them, they are provided simply for show:

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment