Qwen3-Next-80B-A3B-Instruct β€” MLX 5-bit (group size 32)

Summary. This is a 5-bit (Q5) MLX quantization of Qwen3-Next-80B-A3B-Instruct with group size 32. Built for Apple Silicon with Metal acceleration.

  • Base model: Qwen/Qwen3-Next-80B-A3B-Instruct (apache-2.0)
  • Quantization: MLX Q5, q_group_size=32 (some tensors may remain 16-bit for stability)
  • Files: MLX weight shards + config.json; tokenizer files included for drop-in use
  • Intended use: local inference / research on M-series Macs
  • Not intended for: safety-critical decisions; outputs may be inaccurate or biased

Requirements

Built for Apple Silicon with Metal acceleration.

  • Memory: β‰₯96 GB recommended for comfortable headroom at large context lengths.

How to use (MLX)

pip install mlx-lm
from mlx_lm import load, generate

model, tokenizer = load("halley-ai/Qwen3-Next-80B-A3B-Instruct-MLX-5bit-gs32")
print(generate(
    model, tokenizer,
    prompt="Explain the Chudnovsky algorithm to compute Ο€.",
    max_tokens=256, max_kv_size=512
))
python -m mlx_lm generate --model halley-ai/Qwen3-Next-80B-A3B-Instruct-MLX-5bit-gs32 \
  --prompt "Explain the Chudnovsky algorithm to compute pi." \
  --max-kv-size 512 --max-tokens 256

Evaluation

Perplexity (PPL) streaming evaluation on WikiText-2 (raw, test); fast preset with window=stride=4096, ~100k tokens, EOS inserted between docs.

Variant PPL (ctx=4096, fast)
MLX bf16 (reference) 5.14
MLX 6-bit (gs=64) 5.14 (β‰ˆ0.0% vs bf16)
MLX 5-bit (gs=32) 5.20 (+1.2% vs bf16, +1.2% vs 6b/gs64)
MLX 4-bit (gs=64) 5.43 (+5.6% vs bf16, +5.6% vs 6b/gs64)

Notes:

  • Numbers from local MLX runs on Apple Silicon; small variations are expected with tokenizer details, logits dtype, and token subset.

Interpretation

  • 6-bit gs64 matches bf16 on this corpus; use it when maximum quality is the goal.
  • 5-bit gs32 is a balanced pick: near-par PPL with a smaller footprint and strong deterministic math behavior.
  • 4-bit gs64 trades a modest quality drop for the smallest size; good for constrained machines.

Reproduce locally:

python python/scripts/test_perplexity-mlx.py \
  --model_path "/path/to/Qwen3-Next-80B-A3B-Instruct-5bit-gs32" \
  --fast --progress

Conversion details (provenance)

python -m mlx_lm convert \
  --hf-path Qwen3-Next-80B-A3B-Instruct \
  --mlx-path /path/to/Qwen3-Next-80B-A3B-Instruct-5bit-gs32 \
  -q --q-bits 5 --q-group-size 32
  • Some tensors (for example, embeddings/norms/router) may remain 16-bit for numerical stability.

Sibling & reference models

  • halley-ai/Qwen3-Next-80B-A3B-Instruct-MLX-6bit-gs64
  • halley-ai/Qwen3-Next-80B-A3B-Instruct-MLX-4bit-gs64

Limitations and biases

Outputs may be factually wrong or unsafe. Do not use for medical, legal, or financial decisions without human review.

License and credits

  • License: apache-2.0 (inherits from the base model)
  • Base model: Qwen/Qwen3-Next-80B-A3B-Instruct
  • Quantization: Halley AI Lab (MLX Q5, gs=32)
  • Please cite both the base model and this repository when you use the weights.
Downloads last month
65
Safetensors
Model size
80B params
Tensor type
BF16
Β·
U32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for halley-ai/Qwen3-Next-80B-A3B-Instruct-MLX-5bit-gs32

Quantized
(46)
this model