Mistral-Medium-3.5-128B — NVFP4 (W4A4-FP4), v3 calibration

NVFP4 (W4A4-FP4) quantization of mistralai/Mistral-Medium-3.5-128B, produced with llm-compressor and saved in the compressed-tensors nvfp4-pack-quantized format. The vision tower, multi-modal projector, embeddings, and lm_head are kept at bf16 — the language-model body (88 Ministral3DecoderLayer blocks) is the only thing in FP4. Designed to be served by vLLM on Blackwell-class GPUs (SM 12.0+ — RTX 50-series / B-series) using the FlashInfer NVFP4 GEMM kernel.

This release replaces the prior NVFP4 build in this repo. Two things changed:

Calibration corpus expanded from 20 samples on a 4-source mix to 256 samples drawn from a 2560-sample mixed-format corpus that exercises Mistral chat-templated, Anthropic-XML, and OpenAI-tool-JSON surfaces.

YARN config corrected. The bf16 source had rope_yarn_log_multiplier=1.0 upstream, which broke long-context generation. This build's config.json has the corrected 0.0 value, matching Mistral's config fix commit.

Compression summary

Component	Tensors	Dtype	Approx size
`language_model` (88 decoder layers)	2641	NVFP4 (W4A4-FP4 + scales)	63.81 GiB
`vision_tower` (Pixtral)	434	bf16	4.65 GiB
`multi_modal_projector`	4	bf16	348 MiB
`embed_tokens`	1	bf16	3.00 GiB
`lm_head`	1	bf16	3.00 GiB
Total	3081	mixed	74.80 GiB

233 GiB bf16 source → 74.8 GiB on disk. The text body alone is 200 GiB → 63.8 GiB ≈ 3.1× (close to the theoretical 4× for plain W4A4 once you account for per-16 group scales, per-tensor global scales, and the bf16 carve-outs above).

Quantization recipe (v3)

Algorithm: RTN-style QuantizationModifier with scheme="NVFP4" (the canonical recipe from the llm-compressor NVFP4 example, adapted for Mistral3ForConditionalGeneration).
Calibration corpus: 256 samples drawn from the v3 mixed-format corpus at calibration-v3.txt. Composition of the source corpus:
- Mistral chat-templated (~50%): prompts wrapped in Mistral's official chat template.
- Anthropic XML (~25%): the same prompts re-rendered as <role>...</role> XML.
- OpenAI generic JSON (~25%): the same prompts in OpenAI tool-calling JSON shape.
Underlying source mix (512 samples): Claude Opus reasoning traces, Nemotron math + tool-use, code-edit instructions. Plus a "wild-card languages" supplement of 150 samples across 15 typologically diverse Aya languages (amharic, armenian, burmese, georgian, tamil, thai, bengali, nepali, panjabi, basque, finnish, welsh, esperanto, yoruba, egyptian_arabic).
Sequence length: 1024 tokens.

Ignore list (kept at bf16):

ignore = ["lm_head",
          "re:.*vision_tower.*",
          "re:.*multi_modal_projector.*"]

Pipeline: pipeline="sequential" with sequential_targets=["Ministral3DecoderLayer"] — one decoder block onto a single GPU at a time, model otherwise on CPU/bf16. Peak VRAM ~20 GiB on a single 5090 during calibration.

End-to-end runtime on the hardware below: ~68 minutes (faster than the original 20-sample build because the per-Linear NVFP4 weight global-scale pre-pass dominates regardless of sample count).

Reproducing the quantization

The v3 scripts used to build this checkpoint are bundled:

prepare_calibration_v3.py — renders the v3 mixed-format corpus.
quant_nvfp4_v3.py — the actual oneshot run.

pip install "transformers>=5.7.0" \
            git+https://github.com/vllm-project/llm-compressor.git
HF_HOME=/path/to/cache CUDA_VISIBLE_DEVICES=0 python quant_nvfp4_v3.py

The calibration sources stream from the Hub on first run; ~512 rows per source are pulled. Total source-data download is a few hundred MB.

Hardware tested on

GPUs: 4 × NVIDIA RTX 5090 (Blackwell, 32 GB each) at PCIe 5.0
CPU: 64-thread x86-64
RAM: 503 GB DDR5
Storage: ZFS on NVMe
OS: Linux 6.17.0
vLLM image: vllm/vllm-openai:nightly (build with FlashInferCutlassNvFp4 linear kernel and TURBOQUANT attention backend)

vLLM startup loads ~28 GB / GPU at TP=4 with weights + KV cache + cudagraph buffers. Long-context inference (≥150k tokens) verified working with TurboQuant 4-bit KV cache.

Serving with vLLM

A known compatibility wart: recent vLLM nightlies require preprocessor_config.json for Mistral3ForConditionalGeneration even at load time. Mistral's upstream repo does not ship one (only processor_config.json), so a stub preprocessor_config.json is included here, extracted from the embedded image_processor block of processor_config.json. With this file present, multimodal load is straightforward.

Multimodal config (proven)

docker run --rm --gpus '"device=0,1,2,3"' \
  --ipc host --ulimit memlock=-1 \
  -p 11440:8000 \
  -v /path/to/Mistral-Medium-3.5-128B-NVFP4-v3:/model:ro \
  --tmpfs /dev/shm:size=32g \
  vllm/vllm-openai:nightly \
  vllm serve /model \
    --served-model-name mistralai/Mistral-Medium-3.5-128B \
    --port 8000 \
    --tensor-parallel-size 4 \
    --max-num-seqs 8 \
    --max-num-batched-tokens 16 \
    --max-model-len 200000 \
    --gpu-memory-utilization 0.92 \
    --limit-mm-per-prompt '{"image":4}' \
    --tool-call-parser mistral \
    --enable-auto-tool-choice \
    --kv-cache-dtype turboquant_4bit_nc \
    --safetensors-load-strategy=prefetch

Text-only config (drop the `--limit-mm-per-prompt` flag)

The image processor still loads (because Mistral3ForConditionalGeneration is the architecture in config.json), but no image inputs will be accepted at request time. Useful when you only need the text endpoint.

Sampling that works well

Conservative / factual: temperature=0.4 top_p=0.9 top_k=0 min_p=0
Default chat: temperature=0.7 top_p=0.95 top_k=0 min_p=0
Creative writing: temperature=0.85 top_p=0.97 top_k=0 min_p=0.02

Smoke-tested on essay generation, multilingual factual recall, constraint following (forbidden-word lists), narrative writing, and register switching. The v3 build is a noticeable quality jump over the 20-sample build for constraint adherence and vocabulary range.

Known limitations

vLLM nightly version drift. The --limit-mm-per-prompt flag and image processor instantiation logic have changed several times in 2026. If you hit OSError: Can't load image processor, verify preprocessor_config.json is present alongside the model weights.
Vision quality with text-only calibration. llm-compressor only quantizes the language-model body; the vision tower stays bf16, so vision quality is unaffected by NVFP4 quantization. However, the LM-side channels that were co-trained on vision-projected activations are slightly under-observed by text-only calibration. Pure-text inference is unaffected; mixed text+image inference may show small degradations vs the bf16 baseline.

Original Mistral Medium 3.5 128B README

The remainder of this README is reproduced from the upstream mistralai/Mistral-Medium-3.5-128B model card. License, capabilities, and intended-use guidance are owned by Mistral.

Mistral Medium 3.5 128B

Mistral Medium 3.5 is our first flagship merged model. It is a dense 128B model with a 256k context window, handling instruction-following, reasoning, and coding in a single set of weights. Mistral Medium 3.5 replaces its predecessor Mistral Medium 3.1 and Magistral in Le Chat. It also replaces Devstral 2 in our coding agent Vibe. Concretely, expect better performance for instruct, reasoning and coding tasks in a new unified model in comparison with our previous released models.

Reasoning effort is configurable per request, so the same model can answer a quick chat reply or work through a complex agentic run. We trained the vision encoder from scratch to handle variable image sizes and aspect ratios.

Find more information on our blog.

To speed up local inference using vLLM or SGLang, check out our released EAGLE model.

The Transformers config originally had an incorrect entry that caused long-context performance degradation. This has been fixed in this commit. GGUFs generated using the Transformers config prior to this commit are also affected. Please use the correct config for best performance.

Key Features

Mistral Medium 3.5 includes the following architectural choices:

Dense 128B parameters.
256k context length.
Multimodal input: Accepts both text and image input, with text output.
Instruct and Reasoning functionalities with function calls (reasoning effort configurable per request).

Mistral Medium 3.5 offers the following capabilities:

Reasoning Mode: Toggle between fast instant reply mode and reasoning mode, boosting performance with test-time compute when requested.
Vision: Analyzes images and provides insights based on visual content, in addition to text.
Multilingual: Supports dozens of languages, including English, French, Spanish, German, Italian, Portuguese, Dutch, Chinese, Japanese, Korean, and Arabic.
System Prompt: Strong adherence and support for system prompts.
Agentic: Best-in-class agentic capabilities with native function calling and JSON output.
Large Context Window: Supports a 256k context window.

We release this model under a Modified MIT License: Open-source license for both commercial and non-commercial use with exceptions for companies with large revenue.

Recommended Settings

Reasoning Effort:
- 'none' → Do not use reasoning
- 'high' → Use reasoning (recommended for complex prompts and agentic usage) Use reasoning_effort="high" for complex tasks and agentic coding.
Temperature: 0.7 for reasoning_effort="high". Temp between 0.0 and 0.7 for reasoning_effort="none" depending on the task. Generally, lower means answer that are more to the point and higher allows the model to be more creative. It is a good practice to try different values in order to improve the model performance to meet your demands.
Top p: 0.95 for reasoning_effort="high". You can try different values but staying close should achieve best performance. Leave it to None (or 1.0) for reasoning_effort="none".

Benchmarks

Agentic Benchmarks

Mistral Medium 3.5 supersedes all our previous coding models, namely Devstral, across all benchmarks. It scores 91.4% on τ³-Telecom and 77.6% on SWE-Bench Verified. Due to its stronger agentic capabilities, Mistral Medium 3.5 replaces Devstral 2 in our coding agent, Vibe CLI.

Instruction Following, Reasoning, and Coding Benchmarks

We compared Mistral Medium 3.5 with competing models on instruction following, reasoning (math), and coding benchmarks. Thanks to its unified capabilities, it achieves strong results across all these tasks and Mistral Medium 3.5 is now powering Le Chat.

Usage

You can find Mistral Medium 3.5 support on multiple libraries for inference and fine-tuning.

We here thank every contributors and maintainers that helped us making it happen.

Mistral-Vibe

Use Mistral Medium 3.5 with Mistral Vibe.

Install

Install the latest version:

uv pip install mistral-vibe --upgrade

API Usage

Mistral Medium 3.5 can be selected by starting vibe. If it is the first time you launch vibe, it will:

Create a default configuration file at ~/.vibe/config.toml.
Prompt you to enter your API key if it's not already configured.
Save your API key to ~/.vibe/.env for future use.

Now select mistral-medium-3.5 and start building !

Local server

If instead of pinging the Mistral API, you want to use a local vLLM server, you can do the following:

1. Spin up a vllm server as explained in Usage - vllm
1. Add the model configuration in ~/.vibe/config.toml:

display_name = "Mistral Medium 3.5 (local vLLM)"
description = "Mistral Medium 3.5 mode using local vLLM"
safety = "neutral"

active_model = "mistral-medium-3.5" # Make sure this is the only active_model entry
[[providers]]
name = "vllm"
api_base = "http://<your-host-url>:8000/v1"
api_key_env_var = ""
backend = "generic"
api_style = "reasoning"

[[models]]
name = "mistralai/Mistral-Medium-3.5-128B"
provider = "vllm"
alias = "mistral-medium-3.5"
thinking = "high"
temperature = 0.7
auto_compact_threshold = 168000

[tools.bash]
default_timeout = 1200

Notes:

Make sure to overwrite <your-host-url> with your server's url.
Other inference backends are also supported. Please look at Mistral Vibe repo for more info.

Then restart vibe and "tab-shift" to "mistral-medium-3.5" mode.

Give it a try on some coding agentic tasks and start building some cool stuff !

Inference

The model can be deployed with:

vllm (recommended): See here.
llama.cpp: See here for Unsloth's GGUFs.
LM studio: WIP stay tuned !
Ollama: See here.
SGLang: See here.
transformers: See here.

For optimal performance, we recommend using the Mistral AI API if local serving is subpar.

Make sure that frameworks relying on the Transformers configuration, including GGUF files, are up to date with the fixes introduced in this commit. Otherwise, you will experience subpar performance, especially in long-context sessions.

Fine-Tuning

Fine-tune the model via:

Axolotl: See here.
Unsloth: See here.

vLLM (Recommended)

We recommend using Mistral Medium 3.5 with the vLLM library for production-ready inference.

To speed up local inference using vLLM, check out our released EAGLE model

Installation

Make sure to install vllm nightly:

uv pip install -U vllm \
   --torch-backend=auto \
   --extra-index-url https://wheels.vllm.ai/nightly

Doing so should automatically install mistral_common >= 1.11.1 and transformers >= 5.4.0.

To check:

python -c "import mistral_common; print(mistral_common.__version__)"
python -c "import transformers; print(transformers.__version__)"

You can also make use of a ready-to-go docker image or on the docker hub.

Serve the Model

We recommend a server/client setup:

vllm serve mistralai/Mistral-Medium-3.5-128B --tensor-parallel-size 8 \
  --tool-call-parser mistral --enable-auto-tool-choice --reasoning-parser mistral --max_num_batched_tokens 16384 --max_num_seqs 128 \
  --gpu_memory_utilization 0.8

Ping the Server

Instruction Following

Mistral Medium 3.5 can follow your instructions to the letter.

from datetime import datetime, timedelta

from huggingface_hub import hf_hub_download
from openai import OpenAI

# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"

REASONING_EFFORT = "none" # Toggle reasoning with 'high'.

match REASONING_EFFORT:
    case "none":
        TEMP = 0.1
        TOP_P = None
    case "high":
        TEMP = 0.7
        TOP_P = 0.95
    case _:
        raise ValueError("Only REASONING_EFFORT in ['none', 'high'] are supported.")

client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

models = client.models.list()
model = models.data[0].id


def load_system_prompt(repo_id: str, filename: str) -> str:
    file_path = hf_hub_download(repo_id=repo_id, filename=filename)
    with open(file_path, "r") as file:
        system_prompt = file.read()
    today = datetime.today().strftime("%Y-%m-%d")
    yesterday = (datetime.today() - timedelta(days=1)).strftime("%Y-%m-%d")
    model_name = repo_id.split("/")[-1]
    return system_prompt.format(name=model_name, today=today, yesterday=yesterday)


SYSTEM_PROMPT = load_system_prompt(model, "SYSTEM_PROMPT.txt")

messages = [
    {"role": "system", "content": SYSTEM_PROMPT},
    {
        "role": "user",
        "content": "Write me a sentence where every word starts with the next letter in the alphabet - start with 'a' and end with 'z'.",
    },
]

response = client.chat.completions.create(
    model=model,
    messages=messages,
    reasoning_effort=REASONING_EFFORT,
    temperature=TEMP,
    top_p=TOP_P,
)

print("==============================================================")
print(f"Request with {REASONING_EFFORT=}, {TEMP=} and {TOP_P=}.")
print("==============================================================")
print("REASONING")
print("~~~~~~~~~")
print(response.choices[0].message.reasoning)
print("==============================================================")
print("CONTENT")
print("~~~~~~~")
print(response.choices[0].message.content)

Tool Call

Let's solve some equations thanks to our simple Python calculator tool.

import json
from datetime import datetime, timedelta

from openai import OpenAI
from huggingface_hub import hf_hub_download

# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"

REASONING_EFFORT = "none" # Toggle reasoning with 'high'.

match REASONING_EFFORT:
    case "none":
        TEMP = 0.1
        TOP_P = None
    case "high":
        TEMP = 0.7
        TOP_P = 0.95
    case _:
        raise ValueError("Only REASONING_EFFORT in ['none', 'high'] are supported.")

client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

models = client.models.list()
model = models.data[0].id


def load_system_prompt(repo_id: str, filename: str) -> str:
    file_path = hf_hub_download(repo_id=repo_id, filename=filename)
    with open(file_path, "r") as file:
        system_prompt = file.read()
    today = datetime.today().strftime("%Y-%m-%d")
    yesterday = (datetime.today() - timedelta(days=1)).strftime("%Y-%m-%d")
    model_name = repo_id.split("/")[-1]
    return system_prompt.format(name=model_name, today=today, yesterday=yesterday)


SYSTEM_PROMPT = load_system_prompt(model, "SYSTEM_PROMPT.txt")

image_url = "https://math-coaching.com/img/fiche/46/expressions-mathematiques.jpg"


def my_calculator(expression: str) -> str:
    return str(eval(expression))


tools = [
    {
        "type": "function",
        "function": {
            "name": "my_calculator",
            "description": "A calculator that can evaluate a mathematical expression.",
            "parameters": {
                "type": "object",
                "properties": {
                    "expression": {
                        "type": "string",
                        "description": "The mathematical expression to evaluate.",
                    },
                },
                "required": ["expression"],
            },
        },
    },
    {
        "type": "function",
        "function": {
            "name": "rewrite",
            "description": "Rewrite a given text for improved clarity",
            "parameters": {
                "type": "object",
                "properties": {
                    "text": {
                        "type": "string",
                        "description": "The input text to rewrite",
                    }
                },
            },
        },
    },
]

messages = [
    {"role": "system", "content": SYSTEM_PROMPT},
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": "Thanks to your calculator, compute the results for the equations that involve numbers displayed in the image.",
            },
            {
                "type": "image_url",
                "image_url": {
                    "url": image_url,
                },
            },
        ],
    },
]

response = client.chat.completions.create(
    model=model,
    messages=messages,
    tools=tools,
    tool_choice="auto",
    reasoning_effort=REASONING_EFFORT,
    temperature=TEMP,
    top_p=TOP_P,
)

tool_calls = response.choices[0].message.tool_calls

results = []
for tool_call in tool_calls:
    function_name = tool_call.function.name
    function_args = tool_call.function.arguments
    if function_name == "my_calculator":
        result = my_calculator(**json.loads(function_args))
        results.append(result)

messages.append({"role": "assistant", "tool_calls": tool_calls})
for tool_call, result in zip(tool_calls, results):
    messages.append(
        {
            "role": "tool",
            "tool_call_id": tool_call.id,
            "name": tool_call.function.name,
            "content": result,
        }
    )


response = client.chat.completions.create(
    model=model,
    messages=messages,
    reasoning_effort=REASONING_EFFORT,
    temperature=TEMP,
    top_p=TOP_P,
)

print("==============================================================")
print(f"Request with {REASONING_EFFORT=}, {TEMP=} and {TOP_P=}.")
print("==============================================================")
print("REASONING")
print("~~~~~~~~~")
print(response.choices[0].message.reasoning)
print("==============================================================")
print("CONTENT")
print("~~~~~~~")
print(response.choices[0].message.content)

Vision Reasoning

Let's see if the Mistral Medium 3.5 knows when to pick a fight !

from datetime import datetime, timedelta

from openai import OpenAI
from huggingface_hub import hf_hub_download

# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"

REASONING_EFFORT = "high" # Remove reasoning with 'none'.

match REASONING_EFFORT:
    case "none":
        TEMP = 0.1
        TOP_P = None
    case "high":
        TEMP = 0.7
        TOP_P = 0.95
    case _:
        raise ValueError("Only REASONING_EFFORT in ['none', 'high'] are supported.")

client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

models = client.models.list()
model = models.data[0].id


def load_system_prompt(repo_id: str, filename: str) -> str:
    file_path = hf_hub_download(repo_id=repo_id, filename=filename)
    with open(file_path, "r") as file:
        system_prompt = file.read()
    today = datetime.today().strftime("%Y-%m-%d")
    yesterday = (datetime.today() - timedelta(days=1)).strftime("%Y-%m-%d")
    model_name = repo_id.split("/")[-1]
    return system_prompt.format(name=model_name, today=today, yesterday=yesterday)


SYSTEM_PROMPT = load_system_prompt(model, "SYSTEM_PROMPT.txt")
image_url = "https://static.wikia.nocookie.net/essentialsdocs/images/7/70/Battle.png/revision/latest?cb=20220523172438"

messages = [
    {"role": "system", "content": SYSTEM_PROMPT},
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": "What action do you think I should take in this situation? List all the possible actions and explain why you think they are good or bad.",
            },
            {"type": "image_url", "image_url": {"url": image_url}},
        ],
    },
]


response = client.chat.completions.create(
    model=model,
    messages=messages,
    reasoning_effort=REASONING_EFFORT,
    temperature=TEMP,
    top_p=TOP_P,
)

print("==============================================================")
print(f"Request with {REASONING_EFFORT=}, {TEMP=} and {TOP_P=}.")
print("==============================================================")
print("REASONING")
print("~~~~~~~~~")
print(response.choices[0].message.reasoning)
print("==============================================================")
print("CONTENT")
print("~~~~~~~")
print(response.choices[0].message.content)

SGLang

Serve Mistral Medium 3.5 with the SGLang library for production-ready inference.

To speed up local inference using SGLang, check out our released EAGLE model.

Installation

Day-zero support ships in dedicated docker tags:

docker pull lmsysorg/sglang:dev-mistral-medium-3.5         # H100 / H200 (Hopper, CUDA 12.9)
docker pull lmsysorg/sglang:dev-cu13-mistral-medium-3.5    # B200 / B300 (Blackwell, CUDA 13.0)

Or follow the SGLang installation guide. Requires transformers >= 5.4.0.

Serve the Model

python -m sglang.launch_server --model-path mistralai/Mistral-Medium-3.5-128B \
  --tp 8 --tool-call-parser mistral --reasoning-parser mistral

For the full deployment guide, benchmarks, and per-request examples (reasoning effort, tool calls, vision, streaming), see the SGLang cookbook entry for Mistral Medium 3.5.

Transformers

Installation

First install the Transformers framework to use Mistral Medium 3.5:

uv pip install transformers

Inference

Python Inference Snippet

import torch
from transformers import AutoProcessor, Mistral3ForConditionalGeneration


REASONING_EFFORT = "high" # Remove reasoning with 'none'.

match REASONING_EFFORT:
    case "none":
        TEMP = 0.1
        TOP_P = 1.0
    case "high":
        TEMP = 0.7
        TOP_P = 0.95
    case _:
        raise ValueError("Only REASONING_EFFORT in ['none', 'high'] are supported.")

model_id = "mistralai/Mistral-Medium-3.5-128B"

processor = AutoProcessor.from_pretrained(model_id)
model = Mistral3ForConditionalGeneration.from_pretrained(
    model_id, device_map="auto"
)

image_url = "https://static.wikia.nocookie.net/essentialsdocs/images/7/70/Battle.png/revision/latest?cb=20220523172438"

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": "What action do you think I should take in this situation? List all the possible actions and explain why you think they are good or bad.",
            },
            {"type": "image_url", "image_url": {"url": image_url}},
        ],
    },
]


inputs = processor.apply_chat_template(messages, return_tensors="pt", tokenize=True, return_dict=True, reasoning_effort=REASONING_EFFORT)
inputs = inputs.to(model.device)

output = model.generate(
    **inputs,
    max_new_tokens=1024,
    do_sample=True,
    temperature=TEMP,
    top_p=TOP_P,
)[0]

# Setting `skip_special_tokens=False` to visualize reasoning trace between [THINK] [/THINK] tags.
decoded_output = processor.decode(output[len(inputs["input_ids"][0]):], skip_special_tokens=False) 
print(decoded_output)

License

This model is licensed under a Modified MIT License.

You must not use this model in a manner that infringes, misappropriates, or otherwise violates any third party’s rights, including intellectual property rights.

Downloads last month: 10,835

Safetensors

Model size

74B params

Tensor type

F32

BF16

F8_E4M3

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for RecViking/Mistral-Medium-3.5-128B-NVFP4

Base model

mistralai/Mistral-Medium-3.5-128B

Quantized

(20)

this model

Mistral-Medium-3.5-128B — NVFP4 (W4A4-FP4), v3 calibration

Compression summary

Quantization recipe (v3)

Reproducing the quantization

Hardware tested on

Serving with vLLM

Multimodal config (proven)

Text-only config (drop the --limit-mm-per-prompt flag)

Sampling that works well

Known limitations

Original Mistral Medium 3.5 128B README

Mistral Medium 3.5 128B

Key Features

Recommended Settings

Benchmarks

Agentic Benchmarks

Instruction Following, Reasoning, and Coding Benchmarks

Usage

Mistral-Vibe

Install

API Usage

Local server

Inference

Fine-Tuning

vLLM (Recommended)

Installation

Serve the Model

Ping the Server

SGLang

Installation

Serve the Model

Transformers

Installation

Inference

License

Model tree for RecViking/Mistral-Medium-3.5-128B-NVFP4

Text-only config (drop the `--limit-mm-per-prompt` flag)