Model Overview

Model Architecture: DeepSeek-R1
- Input: Text
- Output: Text
Supported Hardware Microarchitecture: AMD MI350/MI355
ROCm: 7.0
PyTorch: 2.8.0
Transformers: 4.53.0
Operating System(s): Linux
Inference Engine: SGLang
Model Optimizer: AMD-Quark (V0.10)
- Weight quantization: OCP MXFP4, Static
- Activation quantization: OCP MXFP4, Dynamic
Calibration Dataset: Pile

This model was built with deepseek-ai DeepSeek-R1 model by applying AMD-Quark for MXFP4 quantization.

Model Quantization

The model was quantized from deepseek-ai/DeepSeek-R1 using AMD-Quark. Both weights and activations were quantized to MXFP4 format, and the AutoSmoothQuant algorithm was applied to enhance accuracy.

Preprocessing requirement:

Before executing the quantization script below, the original FP8 model must first be dequantized to BFloat16. You can either perform the dequantization manually using this conversion script, or use the pre-converted BFloat16 model available at unsloth/DeepSeek-R1-BF16.

Quantization scripts:

cd Quark/examples/torch/language_modeling/llm_ptq/
exclude_layers="*self_attn* *mlp.gate.* *lm_head"
python3 quantize_quark.py --model_dir $MODEL_DIR \
                          --quant_scheme w_mxfp4_a_mxfp4 \
                          --group_size 32 \
                          --num_calib_data 128 \
                          --exclude_layers $exclude_layers \
                          --skip_evaluation \
                          --multi_gpu \
                          --quant_algo autosmoothquant \
                          --model_export hf_format \
                          --output_dir amd/DeepSeek-R1-MXFP4-ASQ

Deployment

Use with SGLang

This model can be deployed efficiently using the SGLang backend.

Evaluation

The model was evaluated on reasoning tasks including AIME24, MMLU_COT, and GSM8K via forked lm-evaluation-harness .

Accuracy

Benchmark	DeepSeek-R1	DeepSeek-R1-MXFP4-ASQ(this model)	Recovery
AIME24	78.0	76.0	97.44%
MMLU_COT	79.90	79.65	99.69%
GSM8K	95.81	95.42	99.59%

Reproduction

The results of AIME24 and MMLU_COT were obtained using SGLang while result of GSM8K was obtained using vLLM. All the evaluations were conducted via forked lm-evaluation-harness.

AIME24

# Launching server
python3 -m sglang.launch_server \
    --model amd/DeepSeek-R1-MXFP4-ASQ \
    --tp 8  \
    --trust-remote-code  \
    --n-share-experts-fusion 8 \
    --disable-radix-cache

# Evaluating
lm_eval --model local-completions \
    --model_args model=amd/DeepSeek-R1-MXFP4-ASQ,base_url=http://localhost:30000/v1/completions,num_concurrent=999999,timeout=999999,tokenized_requests=False,max_length=32000,temperature=0.6,top_p=0.95 \
    --tasks aime24 \
    --num_fewshot 0 \
    --gen_kwargs "do_sample=True,temperature=0.6,top_p=0.95,max_tokens=32000" \
    --batch_size auto \
    --log_samples \
    --output_path output_data/aime24 2>&1 | tee logs/aime24.log

MMLU_COT

# Launching server
python3 -m sglang.launch_server \
    --model amd/DeepSeek-R1-MXFP4-ASQ \
    --tp 8 \
    --trust-remote-code \
    --chunked-prefill-size 32768 \
    --mem-fraction-static 0.83

# Evaluating
lm_eval --model local-completions \
    --model_args model=amd/DeepSeek-R1-MXFP4-ASQ,base_url=http://localhost:30000/v1/completions,num_concurrent=999999,timeout=999999,tokenized_requests=False,max_length=32000,temperature=0.6,top_p=0.95 \
    --tasks mmlu_cot \
    --num_fewshot 0 \
    --gen_kwargs "do_sample=True,temperature=0.6,top_p=0.95,max_tokens=32000" \
    --batch_size auto \
    --log_samples \
    --output_path output_data/mmmlu_cot 2>&1 | tee logs/mmmlu_cot.log

GSM8K

lm_eval --model local-completions \
    --model_args model=amd/DeepSeek-R1-MXFP4-ASQ,base_url=http://localhost:30000/v1/completions,num_concurrent=999999,timeout=999999,tokenized_requests=False,max_length=8096 \
    --tasks gsm8k \
    --num_fewshot 5 \
    --batch_size auto \
    --log_samples \
    --output_path output_data/gsm8k 2>&1 | tee logs/gsm8k.log