---
license: mit
base_model:
- deepseek-ai/DeepSeek-R1-0528-Qwen3-8B
- Intel/DeepSeek-R1-0528-Qwen3-8B-int4-AutoRound-awq-inc
tags:
- mlc-llm
---

### WIP, will make sure it works, and explain how to use.

I am working on jetson devices

You need my patched version of MLC to run this model. (It needs AWQ support for qwen3)
[My patch](https://github.com/corupta/jetson-containers-jp5/blob/master/packages/llm/mlc/patches/d2118b3.diff)


### If you have a Jetson Orin AGX
use `corupta/mlc:0.20.0-r36.4-cp312-cu128-24.04` from docker hub. Verified it works in my case.
```
docker run -dit --rm \
  --name llm_server \
  --gpus all \
  -p 9000:9000 \
  -e DOCKER_PULL=always --pull always \
  -e HF_HUB_CACHE=/root/.cache/huggingface \
  -v /mnt/nvme/cache:/root/.cache \
  corupta/mlc:0.20.0-r36.4-cp312-cu128-24.04 \
    sudonim serve \
      --model corupta/DeepSeek-R1-0528-Qwen3-8B-int4-AutoRound-inc-q4f16_awq-MLC \
      --quantization q4f16_awq \
      --max-batch-size 1 \
      --host 0.0.0.0 \
      --port 9000
```

Also Tried Speculative Decoding via
```
docker run -dit --rm   --name llm_server   --gpus all   -p 9000:9000   -e DOCKER_PULL=always --pull always   -e HF_HUB_CACHE=/root/.cache/huggingface   -v /mnt/nvme/cache:/root/.cache   corupta/mlc:0.20.0-r36.4-cp312-cu128-24.04     sudonim serve       --model jukofyork/DeepSeek-R1-0528-CODER-DRAFT-0.6B-v1.0       --quantization q4f16_0       --chat-template deepseek_r1_qwen       --max-batch-size 1       --host 0.0.0.0       --port 9000
```
Fix created mlc config `"context_window_size": 131072, ... "stop_token_ids": [0, 1], ... "pad_token_id": 2, "bos_token_id": 0, "eos_token_id": 1`
```
docker run -it --rm --gpus all -v /mnt/nvme/cache:/root/.cache -p 9000:9000 \
  mlc:0.20.0-r36.4-cp312-cu128-24.04 \
  mlc_llm serve --mode interactive --device cuda \
  --host 0.0.0.0 --port 9000 --overrides='gpu_memory_utilization=0.90' \
  --model-lib /root/.cache/mlc_llm/corupta/DeepSeek-R1-0528-Qwen3-8B-int4-AutoRound-inc-q4f16_awq-MLC/aarch64-cu128-sm87.so \
  /root/.cache/mlc_llm/corupta/DeepSeek-R1-0528-Qwen3-8B-int4-AutoRound-inc-q4f16_awq-MLC \
  --additional-models /root/.cache/mlc_llm/jukofyork/DeepSeek-R1-0528-CODER-DRAFT-0.6B-v1.0-q4f16_0-MLC,/root/.cache/mlc_llm/jukofyork/DeepSeek-R1-0528-CODER-DRAFT-0.6B-v1.0-q4f16_0-MLC/aarch64-cu128-sm87.so \
  --speculative-mode small_draft
```
But its overhead was bigger than its yield. (It generated tokens sometimes faster, sometimes slower (maybe 20-50% hit on speculative output, I didn't really record this ratio), but yielded an average of the same speed, or perhaps 1% faster)


### If you have a Jetson Xavier AGX
use `corupta/mlc:0.20.0-r35.6.1-cp312-cu124-22.04` from docker hub.
```
docker run -dit --rm \
  --name llm_server \
  --gpus all \
  -p 9000:9000 \
  -e HF_HUB_CACHE=/root/.cache/huggingface \
  -v /mnt/nvme/cache:/root/.cache \
  mlc:r35.6.1-cp312-cu124-22.04 \
    sudonim serve \
      --model corupta/DeepSeek-R1-0528-Qwen3-8B-int4-AutoRound-inc-q4f16_awq-MLC \
      --quantization q4f16_awq \
      --max-batch-size 1 \
      --host 0.0.0.0 \
      --port 9000
```

## Jetpack5 Image is built with [corupta/jetson-containers-jp5](https://github.com/corupta/jetson-containers-jp5)

~~When running the model you might need to tweak `prefill_chunk` in sudonim or `prefill_chunk_size` in mlc-llm, to fit the model to your memory constraints.~~

The model is based on [Intel/DeepSeek-R1-0528-Qwen3-8B-int4-AutoRound-awq-inc](https://huggingface.co/Intel/DeepSeek-R1-0528-Qwen3-8B-int4-AutoRound-awq-inc/blob/main/README.md) and converted via below command (along with manual modifications to mlc-chat-config.json)
```
mlc_llm gen_config $LOCAL_MODEL_PATH \
  --quantization $QUANTIZATION \
  --conv-template $CONV_TEMPLATE \
  -o $MLC_MODEL_PATH
mlc_llm convert_weight $LOCAL_MODEL_PATH  \
  --quantization $QUANTIZATION \
  -o $MLC_MODEL_PATH \
  --source-format awq \
  --source $LOCAL_MODEL_PATH/model.safetensors.index.json
```

```
This model is an int4 model with group_size 128 and symmetric quantization of deepseek-ai/DeepSeek-R1-0528-Qwen3-8B generated by intel/auto-round algorithm.

Please follow the license of the original model.
```