Instructions to use stacksnathan/Qwen3-Coder-Next-80B-APEX-I-Quality-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use stacksnathan/Qwen3-Coder-Next-80B-APEX-I-Quality-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="stacksnathan/Qwen3-Coder-Next-80B-APEX-I-Quality-GGUF",
	filename="Qwen3-Coder-Next-APEX-I-Quality.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use stacksnathan/Qwen3-Coder-Next-80B-APEX-I-Quality-GGUF with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf stacksnathan/Qwen3-Coder-Next-80B-APEX-I-Quality-GGUF
# Run inference directly in the terminal:
llama-cli -hf stacksnathan/Qwen3-Coder-Next-80B-APEX-I-Quality-GGUF

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf stacksnathan/Qwen3-Coder-Next-80B-APEX-I-Quality-GGUF
# Run inference directly in the terminal:
llama-cli -hf stacksnathan/Qwen3-Coder-Next-80B-APEX-I-Quality-GGUF

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf stacksnathan/Qwen3-Coder-Next-80B-APEX-I-Quality-GGUF
# Run inference directly in the terminal:
./llama-cli -hf stacksnathan/Qwen3-Coder-Next-80B-APEX-I-Quality-GGUF

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf stacksnathan/Qwen3-Coder-Next-80B-APEX-I-Quality-GGUF
# Run inference directly in the terminal:
./build/bin/llama-cli -hf stacksnathan/Qwen3-Coder-Next-80B-APEX-I-Quality-GGUF

Use Docker

docker model run hf.co/stacksnathan/Qwen3-Coder-Next-80B-APEX-I-Quality-GGUF

LM Studio
Jan

vLLM

How to use stacksnathan/Qwen3-Coder-Next-80B-APEX-I-Quality-GGUF with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "stacksnathan/Qwen3-Coder-Next-80B-APEX-I-Quality-GGUF"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "stacksnathan/Qwen3-Coder-Next-80B-APEX-I-Quality-GGUF",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/stacksnathan/Qwen3-Coder-Next-80B-APEX-I-Quality-GGUF

Ollama
How to use stacksnathan/Qwen3-Coder-Next-80B-APEX-I-Quality-GGUF with Ollama:
```
ollama run hf.co/stacksnathan/Qwen3-Coder-Next-80B-APEX-I-Quality-GGUF
```

Unsloth Studio

How to use stacksnathan/Qwen3-Coder-Next-80B-APEX-I-Quality-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for stacksnathan/Qwen3-Coder-Next-80B-APEX-I-Quality-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for stacksnathan/Qwen3-Coder-Next-80B-APEX-I-Quality-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for stacksnathan/Qwen3-Coder-Next-80B-APEX-I-Quality-GGUF to start chatting

How to use stacksnathan/Qwen3-Coder-Next-80B-APEX-I-Quality-GGUF with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf stacksnathan/Qwen3-Coder-Next-80B-APEX-I-Quality-GGUF

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "stacksnathan/Qwen3-Coder-Next-80B-APEX-I-Quality-GGUF"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use stacksnathan/Qwen3-Coder-Next-80B-APEX-I-Quality-GGUF with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf stacksnathan/Qwen3-Coder-Next-80B-APEX-I-Quality-GGUF

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default stacksnathan/Qwen3-Coder-Next-80B-APEX-I-Quality-GGUF

Run Hermes

hermes

Docker Model Runner
How to use stacksnathan/Qwen3-Coder-Next-80B-APEX-I-Quality-GGUF with Docker Model Runner:
```
docker model run hf.co/stacksnathan/Qwen3-Coder-Next-80B-APEX-I-Quality-GGUF
```

Lemonade

How to use stacksnathan/Qwen3-Coder-Next-80B-APEX-I-Quality-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull stacksnathan/Qwen3-Coder-Next-80B-APEX-I-Quality-GGUF

Run and chat with the model

lemonade run user.Qwen3-Coder-Next-80B-APEX-I-Quality-GGUF-{{QUANT_TAG}}

List all available models

lemonade list

Qwen3-Coder-Next 80B — APEX I-Quality GGUF

First APEX I-Quality quantization of Qwen3-Coder-Next 80B, calibrated on a code corpus.

This is an APEX I-Quality quantization of Qwen/Qwen3-Coder-Next — an 80B parameter Mixture-of-Experts model with only 3B active parameters per token, designed specifically for coding agents and local development.

What Makes This Different

APEX I-Quality profile — the highest quality tier in the APEX quantization framework, using per-tensor type optimization for MoE architectures
Code-calibrated imatrix — importance matrix generated from 50,575 code samples (not Wikipedia). The imatrix tells the quantizer which weights matter most for code generation, syntax, tool calling, and agent workloads
Production tested — this exact model runs in production powering PicoClaw coding agents on AMD Ryzen AI Max+ 395 hardware

Files

File	Size	Description
`Qwen3-Coder-Next-APEX-I-Quality.gguf`	54.1 GB	APEX I-Quality quantized model (5.43 BPW)
`imatrix-coder-next.dat`	457 MB	Code-calibrated importance matrix — use this for your own quantizations

Model Details


Architecture	qwen3next (hybrid attention + SSM with MoE)
Total Parameters	79.67B
Active Parameters	~3B per token (10 of 512 experts)
Expert Count	512 experts, 10 active per token
Context Length	262,144 tokens (native)
Original Type	BF16 (148.5 GB)
Quantized Size	54.1 GB (5.43 BPW)
Quantization	APEX I-Quality (Q6_K/Q5_K/IQ4_XS experts, Q8_0 shared, Q6_K attention)

Performance

Tested on AMD Ryzen AI Max+ 395 (128GB unified memory, ROCm/Vulkan):

Metric	Value
Output Speed	~50-60 t/s
Prompt Processing	Fast (MoE architecture)
Memory Usage	~54 GB model + KV cache
Parallel Sessions	4 (with --parallel 4)

The 3B active parameter design means this 80B model runs at speeds comparable to — or faster than — much smaller dense models. On our hardware, it outperforms the 30B variant in both speed and quality.

How to Run

llama.cpp (recommended)

# Clone or download the model
huggingface-cli download stacksnathan/Qwen3-Coder-Next-80B-APEX-I-Quality-GGUF \
  Qwen3-Coder-Next-APEX-I-Quality.gguf \
  --local-dir ./models/

# Run with llama-server
./llama-server \
  -m ./models/Qwen3-Coder-Next-APEX-I-Quality.gguf \
  --host 0.0.0.0 --port 8080 \
  --ctx-size 32768 --parallel 4 \
  -ngl 99 --no-mmap

Ollama

Create a Modelfile:

FROM ./Qwen3-Coder-Next-APEX-I-Quality.gguf
PARAMETER num_ctx 32768

Then:

ollama create coder-next -f Modelfile
ollama run coder-next

Hardware Requirements

Setup	RAM/VRAM	Notes
AMD Ryzen AI Max+ 395	128 GB unified	Recommended. Full GPU offload, fast inference
Apple M4 Max/Ultra	128 GB+ unified	Should work well with Metal
Dual GPU (48GB each)	96 GB+ VRAM	Split across GPUs
CPU + RAM	64 GB+ RAM	Slower, but works with mmap

Minimum ~58 GB free memory for model + KV cache at 32K context.

Using the Imatrix

The included imatrix-coder-next.dat was generated from 50K+ code samples using llama-imatrix. You can use it for your own quantizations of Qwen3-Coder-Next:

# Download just the imatrix
huggingface-cli download stacksnathan/Qwen3-Coder-Next-80B-APEX-I-Quality-GGUF \
  imatrix-coder-next.dat \
  --local-dir ./

# Use it with llama-quantize for custom quants
./llama-quantize \
  --imatrix ./imatrix-coder-next.dat \
  Qwen3-Coder-Next-BF16.gguf \
  output.gguf Q4_K_M

About

Quantized by STACKS! Container Hosting — a cloud platform built on owned hardware. This model powers our PicoClaw AI coding agents, offering unlimited inference at flat-rate pricing.

We believe in giving back to the open source community. This quantization and the code-calibrated imatrix are provided freely under the same Apache 2.0 license as the original model.

Acknowledgments

Qwen Team for the incredible Coder-Next model
Mudler for the APEX quantization framework
eaddario for the code calibration dataset
The llama.cpp community for making local inference possible

Downloads last month: 1,713

GGUF

Model size

80B params

Architecture

qwen3next

Hardware compatibility

We're not able to determine the quantization variants.

View all variants

Model tree for stacksnathan/Qwen3-Coder-Next-80B-APEX-I-Quality-GGUF

Base model

Qwen/Qwen3-Coder-Next

Quantized

(97)

this model