Qwen3-Coder-30B-A3B-Instruct-NVFP4

NVFP4 quantization using llm-compressor v0.8.1 based the officiel NVFP4 example script for Qwen3-30B-A3B.

Dataset adjustments

Because this is the Coder variant, the dataset used for the quantisation calibration has been adjusted the following way:

Instead of the original 20 entries from the regular ultrachat from the Qwen3-30B-A3B example...
...we used 512 samples:
- 336 (~66%) come from the codesearchnet dataset for code related calibration
  - 56 randomly selected python entries
  - 56 randomly selected javascript entries
  - 56 randomly selected java entries
  - 56 randomly selected go entries
  - 56 randomly selected php entries
  - 56 randomly selected ruby entries
- 176 come from the ultrachat dataset for instructions related calibration
MAX_SEQUENCE_LENGTH has been increased from 2048 to 4096.

vLLM execution

Because this is a NVFP4 MoE model, you might have some trouble running the model with the current vLLM version (v0.11.0) (no kernel available). To launch it you will need to compile the CUTLASS FP4 GEMM attention kernel for SM100 (RTX Pro 6000) or SM120 (RTX 5090). vLLM can do it automatically for you with the following configuration :

docker run -ti --name Qwen3-Coder-30B-A3B-NVFP4 --gpus all -v '/srv/mountpoint_with_freespace/cache:/root/.cache' -e VLLM_USE_FLASHINFER_MOE_FP4=1 -p 8000:8000 "vllm/vllm-openai:nightly" "ig1/Qwen3-Coder-30B-A3B-Instruct-NVFP4" --served-model-name Qwen3-Coder-30B-A3B --enable-auto-tool-choice --tool-call-parser qwen3_coder

The important part here is the VLLM_USE_FLASHINFER_MOE_FP4=1 environment variable instructing vLLM to compile the FP4 MoE kernel for your GPU architecture. The more CPU cores you have the more RAM you will need for the CUDA compilation.

For now you need the vllm/vllm-openai:nightly image (currently targeting 0.11.1rc4.dev6+g66a168a19) but once the v0.11.1 is out, that should not be necessary anymore.

A note for 5090 owners

While it is possible for you to run the model there is high chances that you:

are running Windows with WSL2, and thus only giving half of your memory to the WSL virtual machines
have a lot of CPU cores

This will most likely create a situation where the FP4 MoE kernel compilation will triggers a OOM kill within the container. Here is a small guide on how to get it running:

First you need to edit the %USERPROFILE%/.wslconfig file to reduce the CPU cores given to WSL (on so the docker containers you will run) and increase its RAM allocation. Reducing the number of availables cores will reduce the number of compilation jobs in parallel and therefor reduce the RAM consumption. If you have 64GiB of RAM the following configuration will work (otherwise reduce it):

[wsl2]
processors=6
memory=50G

Once the file has been saved, logout and log back in to start your docker desktop with the new limits
Execute the the following command on a PowerShell terminal:

docker run -ti --name Qwen3-Coder-30B-A3B-NVFP4 --gpus all -v 'E:\cache:/root/.cache' -e VLLM_USE_FLASHINFER_MOE_FP4=1 -p 8000:8000 "vllm/vllm-openai:nightly" "ig1/Qwen3-Coder-30B-A3B-Instruct-NVFP4" --served-model-name Qwen3-Coder-30B-A3B --gpu-memory-utilization 0.8 --max-model-len 75K --enable-auto-tool-choice --tool-call-parser qwen3_coder

a. Adjust E:\cache to a folder of your linking. It will contains the huggingface download cache folder but also vLLM cache folder (mostly for torch compilation) but also a bunch of others folders you want to keep between different starts.

b. gpu-memory-utilization and max-model-len have been adjusted to the 32GiB limit of the RTX 5090 and the fact that the host system still need a piece of it.

Let vLLM cook. You can use the Docker Desktop Exec tab to check the compilation activity (and RAM usage !) with htop for example: apt update && apt install -y htop && htop
Once the service has successfully started, CTRL-C the execution to stop the container.
Edit back the %USERPROFILE%/.wslconfig to restore your original values. Log out / Log in to start fresh with this new values.
Open Docker Desktop and simply press the start button of the Qwen3-Coder-30B-A3B-NVFP4 container. You can now simply manage it using the UI when you need it.
Enjoy fast NVFP4 inference !

Downloads last month: 81

Safetensors

Model size

17B params

Tensor type

F32

BF16

F8_E4M3

Model tree for ig1/Qwen3-Coder-30B-A3B-Instruct-NVFP4

Base model

Qwen/Qwen3-Coder-30B-A3B-Instruct

Quantized

(96)

this model

Datasets used to train ig1/Qwen3-Coder-30B-A3B-Instruct-NVFP4

Collection including ig1/Qwen3-Coder-30B-A3B-Instruct-NVFP4

NVFP4

Collection

Fast inference for Blackwell GPUs • 5 items • Updated 1 day ago