awei543

AI & ML interests

None yet

Recent Activity

replied to csabakecskemeti's post 26 days ago

Looking for some help to test an INT8 Deepseek 3.2: SGLang supports Channel wise INT8 quants on CPUs with AMX instructions (Xeon 5 and above AFAIK) https://lmsys.org/blog/2025-07-14-intel-xeon-optimization/ Currently uploading an INT8 version of Deepseek 3.2 Speciale: https://huggingface.co/DevQuasar/deepseek-ai.DeepSeek-V3.2-Speciale-Channel-INT8 I cannot test this I'm on AMD "AssertionError: W8A8Int8LinearMethod on CPU requires that CPU has AMX support" (I assumed it can fall back to some non optimized kernel but seems not) If anyone with the required resources (Intel Xeon 5/6 + ~768-1TB ram) can help to test this that would be awesome. If you have hints how to make this work on AMD Threadripper 7000 Pro series please guide me. Thanks all!

liked a model about 1 month ago

ArliAI/GLM-4.5-Air-Derestricted

new activity 4 months ago

SakuraLLM/Sakura-GalTransl-14B-v3.8:能否发布GGUF以外的格式？

View all activity

Organizations

None yet

replied to csabakecskemeti's post 26 days ago

I've been using SGLang+KTransformers for a while. I happen to have access to a server with an EPYC 9004 CPU. Here are my takeaways:

The hardware setup is a single-socket 9V74 with 288GB RAM and a single RTX 3090. I usually run glm4.5air on it (GPU weights in bf16, CPU weights in int8), getting ~390 tokens/s prefill and ~34 tokens/s decode. For the AMD CPU to reach these speeds, you need to specifically install the BLIS library. The documentation for both AMD and ktransformers was seriously lacking, and the CPU weights also require a specific quantization method.

The above speeds were achieved by running with the --kt-max-deferred-experts-per-token 7 flag. Without it, the decode speed drops by about 1.4x, though the prefill slowdown is less dramatic.

liked a model about 1 month ago