MXFP4 vs Q8 vs bf16

#2
by InformaticsSolutions - opened

They are all the same size. Was wondering why not just go with the highest quant (best accuracy)? In my tests q8 and bf16 are equally fast (or slow!), at ~20tps. Thanks.,

On the model card of the vanilla quant Bartowski explained:
The reason is, the FFN (feed forward networks) of gpt-oss do not behave nicely when quantized to anything other than MXFP4, so they are kept at that level for everything.

The rest of these are provided for your own interest in case you feel like experimenting, but the size savings is basically non-existent so I would not recommend running them, they are provided simply for show:

Sign up or log in to comment