Qwen3-Coder-REAP-25B-A3B AWQ 4-bit

AWQ 4-bit quantization of Cerebras Qwen3-Coder-REAP-25B-A3B — a REAP-pruned (arxiv:2510.13999) variant of Qwen3-Coder-30B-A3B-Instruct — calibrated with thinking + code data, optimized for AMD RDNA4 (gfx1201) inference with SGLang.

Model Details


Base model	cerebras/Qwen3-Coder-REAP-25B-A3B (REAP prune of Qwen3-Coder-30B-A3B-Instruct)
Architecture	Qwen3 MoE (96 experts post-REAP, top-8)
Parameters	~25B total / 3B active
Pruning method	REAP (Router-aware Expert pruning, 25% drop) — distinct from REAM (expert merging)
Layers	48
Context	131K (tested), 256K supported by base
Quantization	Native AWQ 4-bit, group_size=128, fused Triton GEMM
Calibration	GPTQ via llmcompressor, 256 samples × 1024 tokens, `code_thinking` mix (AM-Thinking-v1, NuminaMath-CoT, ultrachat); ignore=`lm_head, mlp.gate, shared_expert.*`

Performance (2x AMD Radeon AI PRO R9700, TP=2, fp8 KV)

sglang.bench_serving, single user, FP8 KV cache, --disable-cuda-graph:

Context	TPOT (ms)	tok/s
128	43.6	22.9
1024	43.7	22.9
8192	44.1	22.7
32768	44.2	22.6
65536	45.5	22.0
131072	45.6	21.9

Flat ~22.5 tok/s decode across the full 131K range — A3B MoE stays bandwidth-bound, no attention scaling cliff.

Notes

This is REAP, not REAM. REAP prunes experts based on router-aware impact scores; REAM (Samsung SAIL) instead merges similar experts. Both shrink MoE models, but with different algorithms and tradeoffs — they're not interchangeable. The base Cerebras prune drops 32 of 128 experts (25% reduction).

The CT (compressed-tensors) format from llmcompressor was converted to native AWQ via convert_moe_ct_to_awq.py — on ROCm the AWQ Triton GEMM kernel is 6× faster than the compressed-tensors path on the same weights.

shared_expert.{gate,up,down}_proj and mlp.gate (router) are preserved in BF16 to avoid the always-on residual / routing path going through INT4. shared_expert_gate (output dim 1) auto-falls-back to BF16 in the converter since AWQ packing requires divisibility by 8.

Usage with SGLang

Tested on the RDNA4 inference stack (SGLang v0.5.10 + 16 RDNA4 patches):

git clone https://github.com/mattbucci/2x-R9700-RDNA4-GFX1201-sglang-inference
cd 2x-R9700-RDNA4-GFX1201-sglang-inference
./scripts/setup.sh
MODEL=mattbucci/Qwen3-Coder-REAP-25B-A3B-AWQ scripts/launch.sh coder-reap-25b

The coder-reap-25b preset auto-detects the AWQ format and uses --quantization moe_wna16 with FP8 KV cache for 131K context single-user.

For other inference engines, this is a standard AWQ 4-bit checkpoint (group_size=128, asymmetric, fused MoE) and should load via vllm / transformers + autoawq without modification.

Hardware

Calibrated and benchmarked on 2× AMD Radeon AI PRO R9700 (gfx1201, RDNA4, 64 GB total VRAM) with ROCm 7.2 and SGLang v0.5.10 + RDNA4 patches. Per-GPU: ~6 GB weights + ~4 GB FP8 KV at 131K + overhead.

Downloads last month: 56

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mattbucci/Qwen3-Coder-REAP-25B-A3B-AWQ

Base model

Qwen/Qwen3-Coder-30B-A3B-Instruct

Finetuned

cerebras/Qwen3-Coder-REAP-25B-A3B

Quantized

(30)

this model

Paper for mattbucci/Qwen3-Coder-REAP-25B-A3B-AWQ

REAP the Experts: Why Pruning Prevails for One-Shot MoE compression

Paper • 2510.13999 • Published Oct 15, 2025 • 19