Compressed with llm-compressor v0.8.1.

# Create a dedicated python env
python3 -m venv llmcompressor
source llmcompressor/bin/activate
# Install llm compressor
pip install llmcompressor
# We need a transformers supporting Qwen3-Next (current llmcompressor v0.8.1 installs it with a lower version)
pip install -U 'transformers>=4.57.0'
# Download original model in HF cache
hf download Qwen/Qwen3-Next-80B-A3B-Instruct
# Install recommended (but not mandatory) flash-linear-attention
pip install flash-linear-attention
# Install recommneded (but not mandatory) causal-conv1d (v1.5.3.post2 is currently the latest stable) - need cuda-toolkit-12-8 & python3-dev
pip install git+https://github.com/Dao-AILab/[email protected]
# Start the quantization script
wget https://github.com/vllm-project/llm-compressor/raw/refs/heads/main/examples/quantization_w4a4_fp4/qwen3_next_example.py
python3 qwen3_next_example.py

Currently vLLM v0.11.0 does not have NVFP4 CUDA kernel for SM 120 (Blackwell RTX Pro 6000). Should work on B200, not tested.

Downloads last month
96
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ig1/Qwen3-Next-80B-A3B-Instruct-NVFP4

Quantized
(65)
this model

Collection including ig1/Qwen3-Next-80B-A3B-Instruct-NVFP4