Instructions to use stacksnathan/Qwen3-Coder-Next-80B-APEX-I-Quality-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use stacksnathan/Qwen3-Coder-Next-80B-APEX-I-Quality-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="stacksnathan/Qwen3-Coder-Next-80B-APEX-I-Quality-GGUF", filename="Qwen3-Coder-Next-APEX-I-Quality.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use stacksnathan/Qwen3-Coder-Next-80B-APEX-I-Quality-GGUF with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf stacksnathan/Qwen3-Coder-Next-80B-APEX-I-Quality-GGUF # Run inference directly in the terminal: llama-cli -hf stacksnathan/Qwen3-Coder-Next-80B-APEX-I-Quality-GGUF
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf stacksnathan/Qwen3-Coder-Next-80B-APEX-I-Quality-GGUF # Run inference directly in the terminal: llama-cli -hf stacksnathan/Qwen3-Coder-Next-80B-APEX-I-Quality-GGUF
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf stacksnathan/Qwen3-Coder-Next-80B-APEX-I-Quality-GGUF # Run inference directly in the terminal: ./llama-cli -hf stacksnathan/Qwen3-Coder-Next-80B-APEX-I-Quality-GGUF
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf stacksnathan/Qwen3-Coder-Next-80B-APEX-I-Quality-GGUF # Run inference directly in the terminal: ./build/bin/llama-cli -hf stacksnathan/Qwen3-Coder-Next-80B-APEX-I-Quality-GGUF
Use Docker
docker model run hf.co/stacksnathan/Qwen3-Coder-Next-80B-APEX-I-Quality-GGUF
- LM Studio
- Jan
- vLLM
How to use stacksnathan/Qwen3-Coder-Next-80B-APEX-I-Quality-GGUF with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "stacksnathan/Qwen3-Coder-Next-80B-APEX-I-Quality-GGUF" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "stacksnathan/Qwen3-Coder-Next-80B-APEX-I-Quality-GGUF", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/stacksnathan/Qwen3-Coder-Next-80B-APEX-I-Quality-GGUF
- Ollama
How to use stacksnathan/Qwen3-Coder-Next-80B-APEX-I-Quality-GGUF with Ollama:
ollama run hf.co/stacksnathan/Qwen3-Coder-Next-80B-APEX-I-Quality-GGUF
- Unsloth Studio
How to use stacksnathan/Qwen3-Coder-Next-80B-APEX-I-Quality-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for stacksnathan/Qwen3-Coder-Next-80B-APEX-I-Quality-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for stacksnathan/Qwen3-Coder-Next-80B-APEX-I-Quality-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for stacksnathan/Qwen3-Coder-Next-80B-APEX-I-Quality-GGUF to start chatting
- Pi
How to use stacksnathan/Qwen3-Coder-Next-80B-APEX-I-Quality-GGUF with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf stacksnathan/Qwen3-Coder-Next-80B-APEX-I-Quality-GGUF
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "stacksnathan/Qwen3-Coder-Next-80B-APEX-I-Quality-GGUF" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use stacksnathan/Qwen3-Coder-Next-80B-APEX-I-Quality-GGUF with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf stacksnathan/Qwen3-Coder-Next-80B-APEX-I-Quality-GGUF
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default stacksnathan/Qwen3-Coder-Next-80B-APEX-I-Quality-GGUF
Run Hermes
hermes
- Docker Model Runner
How to use stacksnathan/Qwen3-Coder-Next-80B-APEX-I-Quality-GGUF with Docker Model Runner:
docker model run hf.co/stacksnathan/Qwen3-Coder-Next-80B-APEX-I-Quality-GGUF
- Lemonade
How to use stacksnathan/Qwen3-Coder-Next-80B-APEX-I-Quality-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull stacksnathan/Qwen3-Coder-Next-80B-APEX-I-Quality-GGUF
Run and chat with the model
lemonade run user.Qwen3-Coder-Next-80B-APEX-I-Quality-GGUF-{{QUANT_TAG}}List all available models
lemonade list
Qwen3-Coder-Next 80B โ APEX I-Quality GGUF
First APEX I-Quality quantization of Qwen3-Coder-Next 80B, calibrated on a code corpus.
This is an APEX I-Quality quantization of Qwen/Qwen3-Coder-Next โ an 80B parameter Mixture-of-Experts model with only 3B active parameters per token, designed specifically for coding agents and local development.
What Makes This Different
- APEX I-Quality profile โ the highest quality tier in the APEX quantization framework, using per-tensor type optimization for MoE architectures
- Code-calibrated imatrix โ importance matrix generated from 50,575 code samples (not Wikipedia). The imatrix tells the quantizer which weights matter most for code generation, syntax, tool calling, and agent workloads
- Production tested โ this exact model runs in production powering PicoClaw coding agents on AMD Ryzen AI Max+ 395 hardware
Files
| File | Size | Description |
|---|---|---|
Qwen3-Coder-Next-APEX-I-Quality.gguf |
54.1 GB | APEX I-Quality quantized model (5.43 BPW) |
imatrix-coder-next.dat |
457 MB | Code-calibrated importance matrix โ use this for your own quantizations |
Model Details
| Architecture | qwen3next (hybrid attention + SSM with MoE) |
| Total Parameters | 79.67B |
| Active Parameters | ~3B per token (10 of 512 experts) |
| Expert Count | 512 experts, 10 active per token |
| Context Length | 262,144 tokens (native) |
| Original Type | BF16 (148.5 GB) |
| Quantized Size | 54.1 GB (5.43 BPW) |
| Quantization | APEX I-Quality (Q6_K/Q5_K/IQ4_XS experts, Q8_0 shared, Q6_K attention) |
Performance
Tested on AMD Ryzen AI Max+ 395 (128GB unified memory, ROCm/Vulkan):
| Metric | Value |
|---|---|
| Output Speed | ~50-60 t/s |
| Prompt Processing | Fast (MoE architecture) |
| Memory Usage | ~54 GB model + KV cache |
| Parallel Sessions | 4 (with --parallel 4) |
The 3B active parameter design means this 80B model runs at speeds comparable to โ or faster than โ much smaller dense models. On our hardware, it outperforms the 30B variant in both speed and quality.
How to Run
llama.cpp (recommended)
# Clone or download the model
huggingface-cli download stacksnathan/Qwen3-Coder-Next-80B-APEX-I-Quality-GGUF \
Qwen3-Coder-Next-APEX-I-Quality.gguf \
--local-dir ./models/
# Run with llama-server
./llama-server \
-m ./models/Qwen3-Coder-Next-APEX-I-Quality.gguf \
--host 0.0.0.0 --port 8080 \
--ctx-size 32768 --parallel 4 \
-ngl 99 --no-mmap
Ollama
Create a Modelfile:
FROM ./Qwen3-Coder-Next-APEX-I-Quality.gguf
PARAMETER num_ctx 32768
Then:
ollama create coder-next -f Modelfile
ollama run coder-next
Hardware Requirements
| Setup | RAM/VRAM | Notes |
|---|---|---|
| AMD Ryzen AI Max+ 395 | 128 GB unified | Recommended. Full GPU offload, fast inference |
| Apple M4 Max/Ultra | 128 GB+ unified | Should work well with Metal |
| Dual GPU (48GB each) | 96 GB+ VRAM | Split across GPUs |
| CPU + RAM | 64 GB+ RAM | Slower, but works with mmap |
Minimum ~58 GB free memory for model + KV cache at 32K context.
Using the Imatrix
The included imatrix-coder-next.dat was generated from 50K+ code samples using llama-imatrix. You can use it for your own quantizations of Qwen3-Coder-Next:
# Download just the imatrix
huggingface-cli download stacksnathan/Qwen3-Coder-Next-80B-APEX-I-Quality-GGUF \
imatrix-coder-next.dat \
--local-dir ./
# Use it with llama-quantize for custom quants
./llama-quantize \
--imatrix ./imatrix-coder-next.dat \
Qwen3-Coder-Next-BF16.gguf \
output.gguf Q4_K_M
About
Quantized by STACKS! Container Hosting โ a cloud platform built on owned hardware. This model powers our PicoClaw AI coding agents, offering unlimited inference at flat-rate pricing.
We believe in giving back to the open source community. This quantization and the code-calibrated imatrix are provided freely under the same Apache 2.0 license as the original model.
Acknowledgments
- Downloads last month
- 1,713
We're not able to determine the quantization variants.
Model tree for stacksnathan/Qwen3-Coder-Next-80B-APEX-I-Quality-GGUF
Base model
Qwen/Qwen3-Coder-Next