I've been using SGLang+KTransformers for a while. I happen to have access to a server with an EPYC 9004 CPU. Here are my takeaways:
The hardware setup is a single-socket 9V74 with 288GB RAM and a single RTX 3090. I usually run glm4.5air on it (GPU weights in bf16, CPU weights in int8), getting ~390 tokens/s prefill and ~34 tokens/s decode. For the AMD CPU to reach these speeds, you need to specifically install the BLIS library. The documentation for both AMD and ktransformers was seriously lacking, and the CPU weights also require a specific quantization method.
The above speeds were achieved by running with the --kt-max-deferred-experts-per-token 7 flag. Without it, the decode speed drops by about 1.4x, though the prefill slowdown is less dramatic.