Nova 70B (NVFP4A16 quant)

This repo contains Nova-70B-Llama-3.3 quantized with NVFP4A16, a 4-bit compression suitable for max performance on all hardware with 8-bit-like accuracy.

ℹ️ Unlike NVFP4 format (4-bit weights + 4-bit activation), NVFP4A16 is not limited to Blackwell GPUs and will be supported efficiently in vLLM with RTX 3000s and RTX 4000s GPUs.

Original Model:

LatitudeGames/Nova-70B-Llama-3.3 Hopper and Blackwell optimized model:
mratsim/Nova-70B-NVFP4

This model requires ~40GiB of VRAM. Make sure to set an appropriate context size --max-model-len in VLLM and/or quantize the KV cache and/or use multiple GPUs with for example tensor-parallelism.

NVFP4 writeups:

📥 Usage & Running Instructions

The model was tested with vLLM + 1x or 2x RTX Pro 6000, here is a script suitable for such configuration with 131072 context length.

Recommendations

It is however recommended to use only 65K context to avoid significant degradation (https://fiction.live/stories/Fiction-liveBench-Sept-29-2025/oQdzQvKHw8JyXbN87)

This model is recommended with "min-p" sampling, this sampling is available through both the oldest Text completions API and the Chat completions API (and there is a new Response API), however most LLM frontends only support modifying min-p when using Text completions. You can however use --override-generation-config "${SAMPLER_JSONCONFIG}" to override the sampler (which is a merge of generation_config.json and vLLM defaults)

Running script

# Model configuration (Mandatory)
MODEL="mratsim/Nova-70B-NVFP4A16"
MODELNAME="nova-70b"
GPU_UTIL=0.45
NUM_GPUS=2

# Sampling configuration (Optional, if departing from `generation_config.json`)
SAMPLER_OVERRIDE='{"temperature": 0.8, "min_p": 0.05, "repetition_penalty": 1.05}'

# Prevent vLLM from using 100% CPU when idle (Very Recommended)
export VLLM_SLEEP_WHEN_IDLE=1

# Use FlashInfer backend (fastest, recommended, "instant" context reprocessing)
export VLLM_ATTENTION_BACKEND=FLASHINFER

vllm serve "${MODEL}" \
  --served-model-name "${MODELNAME}" \
  --tensor-parallel-size "${NUM_GPUS}" \
  --gpu-memory-utilization ${GPU_UTIL} \
  --override-generation-config "${SAMPLER_OVERRIDE}"

ℹ️ The FlashInfer backend may fail with an error similar to Failed to allocate memory for batch_prefill_tmp_v with size XYZ and alignment 16 in AlignedAllocator.

A workaround is running a sed replacement command within vllm install to increase buffer space
sed -i 's/FLASHINFER_WORKSPACE_BUFFER_SIZE = 256 \* 1024 \* 1024/FLASHINFER_WORKSPACE_BUFFER_SIZE = 512 \* 1024 \* 1024/g' vllm/v1/attention/backends/flashinfer.py
This will be fixed by PR https://github.com/vllm-project/vllm/pull/25344

🔬 Quantization method

The llmcompressor library was used with the following recipe:

default_stage:
  default_modifiers:
    QuantizationModifier:
      targets: [Linear]
      ignore: [lm_head]
      scheme: NVFP4A16

NVFP4A16 doesn't require any calibration dataset.

Downloads last month: 28

Safetensors

Model size

41B params

Tensor type

F32

BF16

F8_E4M3

Model tree for mratsim/Nova-70B-NVFP4A16

Base model

meta-llama/Llama-3.1-70B

Finetuned

meta-llama/Llama-3.3-70B-Instruct

Finetuned

LatitudeGames/Wayfarer-Large-70B-Llama-3.3

Quantized

(13)

this model