Mistral-Medium-3.5-128B — NVFP4 (W4A4-FP4), v3 calibration
NVFP4 (W4A4-FP4) quantization of mistralai/Mistral-Medium-3.5-128B,
produced with llm-compressor
and saved in the compressed-tensors nvfp4-pack-quantized format. The vision
tower, multi-modal projector, embeddings, and lm_head are kept at bf16 — the
language-model body (88 Ministral3DecoderLayer blocks) is the only thing in
FP4. Designed to be served by vLLM on Blackwell-class GPUs (SM 12.0+ —
RTX 50-series / B-series) using the FlashInfer NVFP4 GEMM kernel.
This release replaces the prior NVFP4 build in this repo. Two things changed:
- Calibration corpus expanded from 20 samples on a 4-source mix to 256 samples drawn from a 2560-sample mixed-format corpus that exercises Mistral chat-templated, Anthropic-XML, and OpenAI-tool-JSON surfaces.
- YARN config corrected. The bf16 source had
rope_yarn_log_multiplier=1.0upstream, which broke long-context generation. This build'sconfig.jsonhas the corrected0.0value, matching Mistral's config fix commit.
Compression summary
| Component | Tensors | Dtype | Approx size |
|---|---|---|---|
language_model (88 decoder layers) |
2641 | NVFP4 (W4A4-FP4 + scales) | 63.81 GiB |
vision_tower (Pixtral) |
434 | bf16 | 4.65 GiB |
multi_modal_projector |
4 | bf16 | 348 MiB |
embed_tokens |
1 | bf16 | 3.00 GiB |
lm_head |
1 | bf16 | 3.00 GiB |
| Total | 3081 | mixed | 74.80 GiB |
233 GiB bf16 source → 74.8 GiB on disk. The text body alone is 200 GiB → 63.8 GiB ≈ 3.1× (close to the theoretical 4× for plain W4A4 once you account for per-16 group scales, per-tensor global scales, and the bf16 carve-outs above).
Quantization recipe (v3)
Algorithm: RTN-style
QuantizationModifierwithscheme="NVFP4"(the canonical recipe from the llm-compressor NVFP4 example, adapted forMistral3ForConditionalGeneration).Calibration corpus: 256 samples drawn from the v3 mixed-format corpus at
calibration-v3.txt. Composition of the source corpus:- Mistral chat-templated (~50%): prompts wrapped in Mistral's official chat template.
- Anthropic XML (~25%): the same prompts re-rendered as
<role>...</role>XML. - OpenAI generic JSON (~25%): the same prompts in OpenAI tool-calling JSON shape.
Underlying source mix (512 samples): Claude Opus reasoning traces, Nemotron math + tool-use, code-edit instructions. Plus a "wild-card languages" supplement of 150 samples across 15 typologically diverse Aya languages (amharic, armenian, burmese, georgian, tamil, thai, bengali, nepali, panjabi, basque, finnish, welsh, esperanto, yoruba, egyptian_arabic).
Sequence length: 1024 tokens.
Ignore list (kept at bf16):
ignore = ["lm_head", "re:.*vision_tower.*", "re:.*multi_modal_projector.*"]Pipeline:
pipeline="sequential"withsequential_targets=["Ministral3DecoderLayer"]— one decoder block onto a single GPU at a time, model otherwise on CPU/bf16. Peak VRAM ~20 GiB on a single 5090 during calibration.
End-to-end runtime on the hardware below: ~68 minutes (faster than the original 20-sample build because the per-Linear NVFP4 weight global-scale pre-pass dominates regardless of sample count).
Reproducing the quantization
The v3 scripts used to build this checkpoint are bundled:
prepare_calibration_v3.py— renders the v3 mixed-format corpus.quant_nvfp4_v3.py— the actual oneshot run.
pip install "transformers>=5.7.0" \
git+https://github.com/vllm-project/llm-compressor.git
HF_HOME=/path/to/cache CUDA_VISIBLE_DEVICES=0 python quant_nvfp4_v3.py
The calibration sources stream from the Hub on first run; ~512 rows per source are pulled. Total source-data download is a few hundred MB.
Hardware tested on
- GPUs: 4 × NVIDIA RTX 5090 (Blackwell, 32 GB each) at PCIe 5.0
- CPU: 64-thread x86-64
- RAM: 503 GB DDR5
- Storage: ZFS on NVMe
- OS: Linux 6.17.0
- vLLM image:
vllm/vllm-openai:nightly(build with FlashInferCutlassNvFp4 linear kernel and TURBOQUANT attention backend)
vLLM startup loads ~28 GB / GPU at TP=4 with weights + KV cache + cudagraph buffers. Long-context inference (≥150k tokens) verified working with TurboQuant 4-bit KV cache.
Serving with vLLM
A known compatibility wart: recent vLLM nightlies require
preprocessor_config.json for Mistral3ForConditionalGeneration even at load
time. Mistral's upstream repo does not ship one (only processor_config.json),
so a stub preprocessor_config.json is included here, extracted from the
embedded image_processor block of processor_config.json. With this file
present, multimodal load is straightforward.
Multimodal config (proven)
docker run --rm --gpus '"device=0,1,2,3"' \
--ipc host --ulimit memlock=-1 \
-p 11440:8000 \
-v /path/to/Mistral-Medium-3.5-128B-NVFP4-v3:/model:ro \
--tmpfs /dev/shm:size=32g \
vllm/vllm-openai:nightly \
vllm serve /model \
--served-model-name mistralai/Mistral-Medium-3.5-128B \
--port 8000 \
--tensor-parallel-size 4 \
--max-num-seqs 8 \
--max-num-batched-tokens 16 \
--max-model-len 200000 \
--gpu-memory-utilization 0.92 \
--limit-mm-per-prompt '{"image":4}' \
--tool-call-parser mistral \
--enable-auto-tool-choice \
--kv-cache-dtype turboquant_4bit_nc \
--safetensors-load-strategy=prefetch
Text-only config (drop the --limit-mm-per-prompt flag)
The image processor still loads (because Mistral3ForConditionalGeneration is
the architecture in config.json), but no image inputs will be accepted at
request time. Useful when you only need the text endpoint.
Sampling that works well
- Conservative / factual:
temperature=0.4 top_p=0.9 top_k=0 min_p=0 - Default chat:
temperature=0.7 top_p=0.95 top_k=0 min_p=0 - Creative writing:
temperature=0.85 top_p=0.97 top_k=0 min_p=0.02
Smoke-tested on essay generation, multilingual factual recall, constraint following (forbidden-word lists), narrative writing, and register switching. The v3 build is a noticeable quality jump over the 20-sample build for constraint adherence and vocabulary range.
Known limitations
- vLLM nightly version drift. The
--limit-mm-per-promptflag and image processor instantiation logic have changed several times in 2026. If you hitOSError: Can't load image processor, verifypreprocessor_config.jsonis present alongside the model weights. - Vision quality with text-only calibration. llm-compressor only quantizes the language-model body; the vision tower stays bf16, so vision quality is unaffected by NVFP4 quantization. However, the LM-side channels that were co-trained on vision-projected activations are slightly under-observed by text-only calibration. Pure-text inference is unaffected; mixed text+image inference may show small degradations vs the bf16 baseline.
Original Mistral Medium 3.5 128B README
The remainder of this README is reproduced from the upstream
mistralai/Mistral-Medium-3.5-128B
model card. License, capabilities, and intended-use guidance are owned by
Mistral.
Mistral Medium 3.5 128B
Mistral Medium 3.5 is our first flagship merged model. It is a dense 128B model with a 256k context window, handling instruction-following, reasoning, and coding in a single set of weights. Mistral Medium 3.5 replaces its predecessor Mistral Medium 3.1 and Magistral in Le Chat. It also replaces Devstral 2 in our coding agent Vibe. Concretely, expect better performance for instruct, reasoning and coding tasks in a new unified model in comparison with our previous released models.
Reasoning effort is configurable per request, so the same model can answer a quick chat reply or work through a complex agentic run. We trained the vision encoder from scratch to handle variable image sizes and aspect ratios.
Find more information on our blog.
To speed up local inference using vLLM or SGLang, check out our released EAGLE model.
The Transformers config originally had an incorrect entry that caused long-context performance degradation. This has been fixed in this commit. GGUFs generated using the Transformers config prior to this commit are also affected. Please use the correct config for best performance.
Key Features
Mistral Medium 3.5 includes the following architectural choices:
- Dense 128B parameters.
- 256k context length.
- Multimodal input: Accepts both text and image input, with text output.
- Instruct and Reasoning functionalities with function calls (reasoning effort configurable per request).
Mistral Medium 3.5 offers the following capabilities:
- Reasoning Mode: Toggle between fast instant reply mode and reasoning mode, boosting performance with test-time compute when requested.
- Vision: Analyzes images and provides insights based on visual content, in addition to text.
- Multilingual: Supports dozens of languages, including English, French, Spanish, German, Italian, Portuguese, Dutch, Chinese, Japanese, Korean, and Arabic.
- System Prompt: Strong adherence and support for system prompts.
- Agentic: Best-in-class agentic capabilities with native function calling and JSON output.
- Large Context Window: Supports a 256k context window.
We release this model under a Modified MIT License: Open-source license for both commercial and non-commercial use with exceptions for companies with large revenue.
Recommended Settings
- Reasoning Effort:
'none'→ Do not use reasoning'high'→ Use reasoning (recommended for complex prompts and agentic usage) Usereasoning_effort="high"for complex tasks and agentic coding.
- Temperature: 0.7 for
reasoning_effort="high". Temp between 0.0 and 0.7 forreasoning_effort="none"depending on the task. Generally, lower means answer that are more to the point and higher allows the model to be more creative. It is a good practice to try different values in order to improve the model performance to meet your demands. - Top p: 0.95 for
reasoning_effort="high". You can try different values but staying close should achieve best performance. Leave it toNone(or1.0) forreasoning_effort="none".
Benchmarks
Agentic Benchmarks
Mistral Medium 3.5 supersedes all our previous coding models, namely Devstral, across all benchmarks. It scores 91.4% on τ³-Telecom and 77.6% on SWE-Bench Verified. Due to its stronger agentic capabilities, Mistral Medium 3.5 replaces Devstral 2 in our coding agent, Vibe CLI.
Instruction Following, Reasoning, and Coding Benchmarks
We compared Mistral Medium 3.5 with competing models on instruction following, reasoning (math), and coding benchmarks. Thanks to its unified capabilities, it achieves strong results across all these tasks and Mistral Medium 3.5 is now powering Le Chat.
Usage
You can find Mistral Medium 3.5 support on multiple libraries for inference and fine-tuning.
We here thank every contributors and maintainers that helped us making it happen.
Mistral-Vibe
Use Mistral Medium 3.5 with Mistral Vibe.
Install
Install the latest version:
uv pip install mistral-vibe --upgrade
API Usage
Mistral Medium 3.5 can be selected by starting vibe. If it is the first time you launch vibe, it will:
- Create a default configuration file at ~/.vibe/config.toml.
- Prompt you to enter your API key if it's not already configured.
- Save your API key to ~/.vibe/.env for future use.
Now select mistral-medium-3.5 and start building !
Local server
If instead of pinging the Mistral API, you want to use a local vLLM server, you can do the following:
- Spin up a vllm server as explained in
Usage - vllm
- Spin up a vllm server as explained in
- Add the model configuration in
~/.vibe/config.toml:
- Add the model configuration in
display_name = "Mistral Medium 3.5 (local vLLM)"
description = "Mistral Medium 3.5 mode using local vLLM"
safety = "neutral"
active_model = "mistral-medium-3.5" # Make sure this is the only active_model entry
[[providers]]
name = "vllm"
api_base = "http://<your-host-url>:8000/v1"
api_key_env_var = ""
backend = "generic"
api_style = "reasoning"
[[models]]
name = "mistralai/Mistral-Medium-3.5-128B"
provider = "vllm"
alias = "mistral-medium-3.5"
thinking = "high"
temperature = 0.7
auto_compact_threshold = 168000
[tools.bash]
default_timeout = 1200
Notes:
- Make sure to overwrite
<your-host-url>with your server's url. - Other inference backends are also supported. Please look at Mistral Vibe repo for more info.
Then restart vibe and "tab-shift" to "mistral-medium-3.5" mode.
Give it a try on some coding agentic tasks and start building some cool stuff !
Inference
The model can be deployed with:
vllm (recommended): See here.llama.cpp: See here for Unsloth's GGUFs.LM studio: WIP stay tuned !Ollama: See here.SGLang: See here.transformers: See here.
For optimal performance, we recommend using the Mistral AI API if local serving is subpar.
Make sure that frameworks relying on the Transformers configuration, including GGUF files, are up to date with the fixes introduced in this commit. Otherwise, you will experience subpar performance, especially in long-context sessions.
Fine-Tuning
Fine-tune the model via:
vLLM (Recommended)
We recommend using Mistral Medium 3.5 with the vLLM library for production-ready inference.
To speed up local inference using vLLM, check out our released EAGLE model
Installation
Make sure to install vllm nightly:
uv pip install -U vllm \
--torch-backend=auto \
--extra-index-url https://wheels.vllm.ai/nightly
Doing so should automatically install mistral_common >= 1.11.1 and transformers >= 5.4.0.
To check:
python -c "import mistral_common; print(mistral_common.__version__)"
python -c "import transformers; print(transformers.__version__)"
You can also make use of a ready-to-go docker image or on the docker hub.
Serve the Model
We recommend a server/client setup:
vllm serve mistralai/Mistral-Medium-3.5-128B --tensor-parallel-size 8 \
--tool-call-parser mistral --enable-auto-tool-choice --reasoning-parser mistral --max_num_batched_tokens 16384 --max_num_seqs 128 \
--gpu_memory_utilization 0.8
Ping the Server
Instruction Following
Mistral Medium 3.5 can follow your instructions to the letter.
from datetime import datetime, timedelta
from huggingface_hub import hf_hub_download
from openai import OpenAI
# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"
REASONING_EFFORT = "none" # Toggle reasoning with 'high'.
match REASONING_EFFORT:
case "none":
TEMP = 0.1
TOP_P = None
case "high":
TEMP = 0.7
TOP_P = 0.95
case _:
raise ValueError("Only REASONING_EFFORT in ['none', 'high'] are supported.")
client = OpenAI(
api_key=openai_api_key,
base_url=openai_api_base,
)
models = client.models.list()
model = models.data[0].id
def load_system_prompt(repo_id: str, filename: str) -> str:
file_path = hf_hub_download(repo_id=repo_id, filename=filename)
with open(file_path, "r") as file:
system_prompt = file.read()
today = datetime.today().strftime("%Y-%m-%d")
yesterday = (datetime.today() - timedelta(days=1)).strftime("%Y-%m-%d")
model_name = repo_id.split("/")[-1]
return system_prompt.format(name=model_name, today=today, yesterday=yesterday)
SYSTEM_PROMPT = load_system_prompt(model, "SYSTEM_PROMPT.txt")
messages = [
{"role": "system", "content": SYSTEM_PROMPT},
{
"role": "user",
"content": "Write me a sentence where every word starts with the next letter in the alphabet - start with 'a' and end with 'z'.",
},
]
response = client.chat.completions.create(
model=model,
messages=messages,
reasoning_effort=REASONING_EFFORT,
temperature=TEMP,
top_p=TOP_P,
)
print("==============================================================")
print(f"Request with {REASONING_EFFORT=}, {TEMP=} and {TOP_P=}.")
print("==============================================================")
print("REASONING")
print("~~~~~~~~~")
print(response.choices[0].message.reasoning)
print("==============================================================")
print("CONTENT")
print("~~~~~~~")
print(response.choices[0].message.content)
Tool Call
Let's solve some equations thanks to our simple Python calculator tool.
import json
from datetime import datetime, timedelta
from openai import OpenAI
from huggingface_hub import hf_hub_download
# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"
REASONING_EFFORT = "none" # Toggle reasoning with 'high'.
match REASONING_EFFORT:
case "none":
TEMP = 0.1
TOP_P = None
case "high":
TEMP = 0.7
TOP_P = 0.95
case _:
raise ValueError("Only REASONING_EFFORT in ['none', 'high'] are supported.")
client = OpenAI(
api_key=openai_api_key,
base_url=openai_api_base,
)
models = client.models.list()
model = models.data[0].id
def load_system_prompt(repo_id: str, filename: str) -> str:
file_path = hf_hub_download(repo_id=repo_id, filename=filename)
with open(file_path, "r") as file:
system_prompt = file.read()
today = datetime.today().strftime("%Y-%m-%d")
yesterday = (datetime.today() - timedelta(days=1)).strftime("%Y-%m-%d")
model_name = repo_id.split("/")[-1]
return system_prompt.format(name=model_name, today=today, yesterday=yesterday)
SYSTEM_PROMPT = load_system_prompt(model, "SYSTEM_PROMPT.txt")
image_url = "https://math-coaching.com/img/fiche/46/expressions-mathematiques.jpg"
def my_calculator(expression: str) -> str:
return str(eval(expression))
tools = [
{
"type": "function",
"function": {
"name": "my_calculator",
"description": "A calculator that can evaluate a mathematical expression.",
"parameters": {
"type": "object",
"properties": {
"expression": {
"type": "string",
"description": "The mathematical expression to evaluate.",
},
},
"required": ["expression"],
},
},
},
{
"type": "function",
"function": {
"name": "rewrite",
"description": "Rewrite a given text for improved clarity",
"parameters": {
"type": "object",
"properties": {
"text": {
"type": "string",
"description": "The input text to rewrite",
}
},
},
},
},
]
messages = [
{"role": "system", "content": SYSTEM_PROMPT},
{
"role": "user",
"content": [
{
"type": "text",
"text": "Thanks to your calculator, compute the results for the equations that involve numbers displayed in the image.",
},
{
"type": "image_url",
"image_url": {
"url": image_url,
},
},
],
},
]
response = client.chat.completions.create(
model=model,
messages=messages,
tools=tools,
tool_choice="auto",
reasoning_effort=REASONING_EFFORT,
temperature=TEMP,
top_p=TOP_P,
)
tool_calls = response.choices[0].message.tool_calls
results = []
for tool_call in tool_calls:
function_name = tool_call.function.name
function_args = tool_call.function.arguments
if function_name == "my_calculator":
result = my_calculator(**json.loads(function_args))
results.append(result)
messages.append({"role": "assistant", "tool_calls": tool_calls})
for tool_call, result in zip(tool_calls, results):
messages.append(
{
"role": "tool",
"tool_call_id": tool_call.id,
"name": tool_call.function.name,
"content": result,
}
)
response = client.chat.completions.create(
model=model,
messages=messages,
reasoning_effort=REASONING_EFFORT,
temperature=TEMP,
top_p=TOP_P,
)
print("==============================================================")
print(f"Request with {REASONING_EFFORT=}, {TEMP=} and {TOP_P=}.")
print("==============================================================")
print("REASONING")
print("~~~~~~~~~")
print(response.choices[0].message.reasoning)
print("==============================================================")
print("CONTENT")
print("~~~~~~~")
print(response.choices[0].message.content)
Vision Reasoning
Let's see if the Mistral Medium 3.5 knows when to pick a fight !
from datetime import datetime, timedelta
from openai import OpenAI
from huggingface_hub import hf_hub_download
# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"
REASONING_EFFORT = "high" # Remove reasoning with 'none'.
match REASONING_EFFORT:
case "none":
TEMP = 0.1
TOP_P = None
case "high":
TEMP = 0.7
TOP_P = 0.95
case _:
raise ValueError("Only REASONING_EFFORT in ['none', 'high'] are supported.")
client = OpenAI(
api_key=openai_api_key,
base_url=openai_api_base,
)
models = client.models.list()
model = models.data[0].id
def load_system_prompt(repo_id: str, filename: str) -> str:
file_path = hf_hub_download(repo_id=repo_id, filename=filename)
with open(file_path, "r") as file:
system_prompt = file.read()
today = datetime.today().strftime("%Y-%m-%d")
yesterday = (datetime.today() - timedelta(days=1)).strftime("%Y-%m-%d")
model_name = repo_id.split("/")[-1]
return system_prompt.format(name=model_name, today=today, yesterday=yesterday)
SYSTEM_PROMPT = load_system_prompt(model, "SYSTEM_PROMPT.txt")
image_url = "https://static.wikia.nocookie.net/essentialsdocs/images/7/70/Battle.png/revision/latest?cb=20220523172438"
messages = [
{"role": "system", "content": SYSTEM_PROMPT},
{
"role": "user",
"content": [
{
"type": "text",
"text": "What action do you think I should take in this situation? List all the possible actions and explain why you think they are good or bad.",
},
{"type": "image_url", "image_url": {"url": image_url}},
],
},
]
response = client.chat.completions.create(
model=model,
messages=messages,
reasoning_effort=REASONING_EFFORT,
temperature=TEMP,
top_p=TOP_P,
)
print("==============================================================")
print(f"Request with {REASONING_EFFORT=}, {TEMP=} and {TOP_P=}.")
print("==============================================================")
print("REASONING")
print("~~~~~~~~~")
print(response.choices[0].message.reasoning)
print("==============================================================")
print("CONTENT")
print("~~~~~~~")
print(response.choices[0].message.content)
SGLang
Serve Mistral Medium 3.5 with the SGLang library for production-ready inference.
To speed up local inference using SGLang, check out our released EAGLE model.
Installation
Day-zero support ships in dedicated docker tags:
docker pull lmsysorg/sglang:dev-mistral-medium-3.5 # H100 / H200 (Hopper, CUDA 12.9)
docker pull lmsysorg/sglang:dev-cu13-mistral-medium-3.5 # B200 / B300 (Blackwell, CUDA 13.0)
Or follow the SGLang installation guide. Requires transformers >= 5.4.0.
Serve the Model
python -m sglang.launch_server --model-path mistralai/Mistral-Medium-3.5-128B \
--tp 8 --tool-call-parser mistral --reasoning-parser mistral
For the full deployment guide, benchmarks, and per-request examples (reasoning effort, tool calls, vision, streaming), see the SGLang cookbook entry for Mistral Medium 3.5.
Transformers
Installation
First install the Transformers framework to use Mistral Medium 3.5:
uv pip install transformers
Inference
Python Inference Snippet
import torch
from transformers import AutoProcessor, Mistral3ForConditionalGeneration
REASONING_EFFORT = "high" # Remove reasoning with 'none'.
match REASONING_EFFORT:
case "none":
TEMP = 0.1
TOP_P = 1.0
case "high":
TEMP = 0.7
TOP_P = 0.95
case _:
raise ValueError("Only REASONING_EFFORT in ['none', 'high'] are supported.")
model_id = "mistralai/Mistral-Medium-3.5-128B"
processor = AutoProcessor.from_pretrained(model_id)
model = Mistral3ForConditionalGeneration.from_pretrained(
model_id, device_map="auto"
)
image_url = "https://static.wikia.nocookie.net/essentialsdocs/images/7/70/Battle.png/revision/latest?cb=20220523172438"
messages = [
{
"role": "user",
"content": [
{
"type": "text",
"text": "What action do you think I should take in this situation? List all the possible actions and explain why you think they are good or bad.",
},
{"type": "image_url", "image_url": {"url": image_url}},
],
},
]
inputs = processor.apply_chat_template(messages, return_tensors="pt", tokenize=True, return_dict=True, reasoning_effort=REASONING_EFFORT)
inputs = inputs.to(model.device)
output = model.generate(
**inputs,
max_new_tokens=1024,
do_sample=True,
temperature=TEMP,
top_p=TOP_P,
)[0]
# Setting `skip_special_tokens=False` to visualize reasoning trace between [THINK] [/THINK] tags.
decoded_output = processor.decode(output[len(inputs["input_ids"][0]):], skip_special_tokens=False)
print(decoded_output)
License
This model is licensed under a Modified MIT License.
You must not use this model in a manner that infringes, misappropriates, or otherwise violates any third party’s rights, including intellectual property rights.
- Downloads last month
- 10,835
Model tree for RecViking/Mistral-Medium-3.5-128B-NVFP4
Base model
mistralai/Mistral-Medium-3.5-128B


