Qwen3-Coder-30B-A3B-Instruct-NVFP4
NVFP4 quantization using llm-compressor v0.8.1 based the officiel NVFP4 example script for Qwen3-30B-A3B.
Dataset adjustments
Because this is the Coder variant, the dataset used for the quantisation calibration has been adjusted the following way:
- Instead of the original 20 entries from the regular ultrachat from the Qwen3-30B-A3B example...
- ...we used 512 samples:- 336 (~66%) come from the codesearchnet dataset for code related calibration- 56 randomly selected python entries
- 56 randomly selected javascript entries
- 56 randomly selected java entries
- 56 randomly selected go entries
- 56 randomly selected php entries
- 56 randomly selected ruby entries
 
- 176 come from the ultrachat dataset for instructions related calibration
 
- 336 (~66%) come from the codesearchnet dataset for code related calibration
- MAX_SEQUENCE_LENGTHhas been increased from- 2048to- 4096.
vLLM execution
Because this is a NVFP4 MoE model, you might have some trouble running the model with the current vLLM version (v0.11.0) (no kernel available). To launch it you will need to compile the CUTLASS FP4 GEMM attention kernel for SM100 (RTX Pro 6000) or SM120 (RTX 5090). vLLM can do it automatically for you with the following configuration :
docker run -ti --name Qwen3-Coder-30B-A3B-NVFP4 --gpus all -v '/srv/mountpoint_with_freespace/cache:/root/.cache' -e VLLM_USE_FLASHINFER_MOE_FP4=1 -p 8000:8000 "vllm/vllm-openai:nightly" "ig1/Qwen3-Coder-30B-A3B-Instruct-NVFP4" --served-model-name Qwen3-Coder-30B-A3B --enable-auto-tool-choice --tool-call-parser qwen3_coder
The important part here is the VLLM_USE_FLASHINFER_MOE_FP4=1 environment variable instructing vLLM to compile the FP4 MoE kernel for your GPU architecture. The more CPU cores you have the more RAM you will need for the CUDA compilation.
For now you need the vllm/vllm-openai:nightly image (currently targeting 0.11.1rc4.dev6+g66a168a19) but once the v0.11.1 is out, that should not be necessary anymore.
A note for 5090 owners
While it is possible for you to run the model there is high chances that you:
- are running Windows with WSL2, and thus only giving half of your memory to the WSL virtual machines
- have a lot of CPU cores
This will most likely create a situation where the FP4 MoE kernel compilation will triggers a OOM kill within the container. Here is a small guide on how to get it running:
- First you need to edit the %USERPROFILE%/.wslconfigfile to reduce the CPU cores given to WSL (on so the docker containers you will run) and increase its RAM allocation. Reducing the number of availables cores will reduce the number of compilation jobs in parallel and therefor reduce the RAM consumption. If you have 64GiB of RAM the following configuration will work (otherwise reduce it):
[wsl2]
processors=6
memory=50G
- Once the file has been saved, logout and log back in to start your docker desktop with the new limits
- Execute the the following command on a PowerShell terminal:
docker run -ti --name Qwen3-Coder-30B-A3B-NVFP4 --gpus all -v 'E:\cache:/root/.cache' -e VLLM_USE_FLASHINFER_MOE_FP4=1 -p 8000:8000 "vllm/vllm-openai:nightly" "ig1/Qwen3-Coder-30B-A3B-Instruct-NVFP4" --served-model-name Qwen3-Coder-30B-A3B --gpu-memory-utilization 0.8 --max-model-len 75K --enable-auto-tool-choice --tool-call-parser qwen3_coder
  a. Adjust E:\cache to a folder of your linking. It will contains the huggingface download cache folder but also vLLM cache folder (mostly for torch compilation) but also a bunch of others folders you want to keep between different starts.
  b. gpu-memory-utilization and max-model-len have been adjusted to the 32GiB limit of the RTX 5090 and the fact that the host system still need a piece of it.
- Let vLLM cook. You can use the Docker Desktop Exectab to check the compilation activity (and RAM usage !) withhtopfor example:apt update && apt install -y htop && htop
- Once the service has successfully started, CTRL-Cthe execution to stop the container.
- Edit back the %USERPROFILE%/.wslconfigto restore your original values. Log out / Log in to start fresh with this new values.
- Open Docker Desktop and simply press the start button of the Qwen3-Coder-30B-A3B-NVFP4container. You can now simply manage it using the UI when you need it.
- Enjoy fast NVFP4 inference !
- Downloads last month
- 81
Model tree for ig1/Qwen3-Coder-30B-A3B-Instruct-NVFP4
Base model
Qwen/Qwen3-Coder-30B-A3B-Instruct