Qwen3-Coder-REAP-25B-A3B AWQ 4-bit

AWQ 4-bit quantization of Cerebras Qwen3-Coder-REAP-25B-A3B โ€” a REAP-pruned (arxiv:2510.13999) variant of Qwen3-Coder-30B-A3B-Instruct โ€” calibrated with thinking + code data, optimized for AMD RDNA4 (gfx1201) inference with SGLang.

Model Details

Base model cerebras/Qwen3-Coder-REAP-25B-A3B (REAP prune of Qwen3-Coder-30B-A3B-Instruct)
Architecture Qwen3 MoE (96 experts post-REAP, top-8)
Parameters ~25B total / 3B active
Pruning method REAP (Router-aware Expert pruning, 25% drop) โ€” distinct from REAM (expert merging)
Layers 48
Context 131K (tested), 256K supported by base
Quantization Native AWQ 4-bit, group_size=128, fused Triton GEMM
Calibration GPTQ via llmcompressor, 256 samples ร— 1024 tokens, code_thinking mix (AM-Thinking-v1, NuminaMath-CoT, ultrachat); ignore=lm_head, mlp.gate, shared_expert.*

Performance (2x AMD Radeon AI PRO R9700, TP=2, fp8 KV)

sglang.bench_serving, single user, FP8 KV cache, --disable-cuda-graph:

Context TPOT (ms) tok/s
128 43.6 22.9
1024 43.7 22.9
8192 44.1 22.7
32768 44.2 22.6
65536 45.5 22.0
131072 45.6 21.9

Flat ~22.5 tok/s decode across the full 131K range โ€” A3B MoE stays bandwidth-bound, no attention scaling cliff.

Notes

This is REAP, not REAM. REAP prunes experts based on router-aware impact scores; REAM (Samsung SAIL) instead merges similar experts. Both shrink MoE models, but with different algorithms and tradeoffs โ€” they're not interchangeable. The base Cerebras prune drops 32 of 128 experts (25% reduction).

The CT (compressed-tensors) format from llmcompressor was converted to native AWQ via convert_moe_ct_to_awq.py โ€” on ROCm the AWQ Triton GEMM kernel is 6ร— faster than the compressed-tensors path on the same weights.

shared_expert.{gate,up,down}_proj and mlp.gate (router) are preserved in BF16 to avoid the always-on residual / routing path going through INT4. shared_expert_gate (output dim 1) auto-falls-back to BF16 in the converter since AWQ packing requires divisibility by 8.

Usage with SGLang

Tested on the RDNA4 inference stack (SGLang v0.5.10 + 16 RDNA4 patches):

git clone https://github.com/mattbucci/2x-R9700-RDNA4-GFX1201-sglang-inference
cd 2x-R9700-RDNA4-GFX1201-sglang-inference
./scripts/setup.sh
MODEL=mattbucci/Qwen3-Coder-REAP-25B-A3B-AWQ scripts/launch.sh coder-reap-25b

The coder-reap-25b preset auto-detects the AWQ format and uses --quantization moe_wna16 with FP8 KV cache for 131K context single-user.

For other inference engines, this is a standard AWQ 4-bit checkpoint (group_size=128, asymmetric, fused MoE) and should load via vllm / transformers + autoawq without modification.

Hardware

Calibrated and benchmarked on 2ร— AMD Radeon AI PRO R9700 (gfx1201, RDNA4, 64 GB total VRAM) with ROCm 7.2 and SGLang v0.5.10 + RDNA4 patches. Per-GPU: ~6 GB weights + ~4 GB FP8 KV at 131K + overhead.

Downloads last month
56
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for mattbucci/Qwen3-Coder-REAP-25B-A3B-AWQ

Quantized
(30)
this model

Paper for mattbucci/Qwen3-Coder-REAP-25B-A3B-AWQ