NVFP4
Collection
Fast inference for Blackwell GPUs
•
5 items
•
Updated
•
1
Compressed with llm-compressor v0.8.1.
# Create a dedicated python env
python3 -m venv llmcompressor
source llmcompressor/bin/activate
# Install llm compressor
pip install llmcompressor
# We need a transformers supporting Qwen3-Next (current llmcompressor v0.8.1 installs it with a lower version)
pip install -U 'transformers>=4.57.0'
# Download original model in HF cache
hf download Qwen/Qwen3-Next-80B-A3B-Instruct
# Install recommended (but not mandatory) flash-linear-attention
pip install flash-linear-attention
# Install recommneded (but not mandatory) causal-conv1d (v1.5.3.post2 is currently the latest stable) - need cuda-toolkit-12-8 & python3-dev
pip install git+https://github.com/Dao-AILab/[email protected]
# Start the quantization script
wget https://github.com/vllm-project/llm-compressor/raw/refs/heads/main/examples/quantization_w4a4_fp4/qwen3_next_example.py
python3 qwen3_next_example.py
Currently vLLM v0.11.0 does not have NVFP4 CUDA kernel for SM 120 (Blackwell RTX Pro 6000). Should work on B200, not tested.
Base model
Qwen/Qwen3-Next-80B-A3B-Instruct