--- license: mit base_model: - deepseek-ai/DeepSeek-R1-0528-Qwen3-8B - Intel/DeepSeek-R1-0528-Qwen3-8B-int4-AutoRound-awq-inc tags: - mlc-llm --- ### WIP, will make sure it works, and explain how to use. I am working on jetson devices You need my patched version of MLC to run this model. (It needs AWQ support for qwen3) [My patch](https://github.com/corupta/jetson-containers-jp5/blob/master/packages/llm/mlc/patches/d2118b3.diff) ### If you have a Jetson Orin AGX use `corupta/mlc:0.20.0-r36.4-cp312-cu128-24.04` from docker hub. Verified it works in my case. ``` docker run -dit --rm \ --name llm_server \ --gpus all \ -p 9000:9000 \ -e DOCKER_PULL=always --pull always \ -e HF_HUB_CACHE=/root/.cache/huggingface \ -v /mnt/nvme/cache:/root/.cache \ corupta/mlc:0.20.0-r36.4-cp312-cu128-24.04 \ sudonim serve \ --model corupta/DeepSeek-R1-0528-Qwen3-8B-int4-AutoRound-inc-q4f16_awq-MLC \ --quantization q4f16_awq \ --max-batch-size 1 \ --host 0.0.0.0 \ --port 9000 ``` Also Tried Speculative Decoding via ``` docker run -dit --rm --name llm_server --gpus all -p 9000:9000 -e DOCKER_PULL=always --pull always -e HF_HUB_CACHE=/root/.cache/huggingface -v /mnt/nvme/cache:/root/.cache corupta/mlc:0.20.0-r36.4-cp312-cu128-24.04 sudonim serve --model jukofyork/DeepSeek-R1-0528-CODER-DRAFT-0.6B-v1.0 --quantization q4f16_0 --chat-template deepseek_r1_qwen --max-batch-size 1 --host 0.0.0.0 --port 9000 ``` Fix created mlc config `"context_window_size": 131072, ... "stop_token_ids": [0, 1], ... "pad_token_id": 2, "bos_token_id": 0, "eos_token_id": 1` ``` docker run -it --rm --gpus all -v /mnt/nvme/cache:/root/.cache -p 9000:9000 \ mlc:0.20.0-r36.4-cp312-cu128-24.04 \ mlc_llm serve --mode interactive --device cuda \ --host 0.0.0.0 --port 9000 --overrides='gpu_memory_utilization=0.90' \ --model-lib /root/.cache/mlc_llm/corupta/DeepSeek-R1-0528-Qwen3-8B-int4-AutoRound-inc-q4f16_awq-MLC/aarch64-cu128-sm87.so \ /root/.cache/mlc_llm/corupta/DeepSeek-R1-0528-Qwen3-8B-int4-AutoRound-inc-q4f16_awq-MLC \ --additional-models /root/.cache/mlc_llm/jukofyork/DeepSeek-R1-0528-CODER-DRAFT-0.6B-v1.0-q4f16_0-MLC,/root/.cache/mlc_llm/jukofyork/DeepSeek-R1-0528-CODER-DRAFT-0.6B-v1.0-q4f16_0-MLC/aarch64-cu128-sm87.so \ --speculative-mode small_draft ``` But its overhead was bigger than its yield. (It generated tokens sometimes faster, sometimes slower (maybe 20-50% hit on speculative output, I didn't really record this ratio), but yielded an average of the same speed, or perhaps 1% faster) ### If you have a Jetson Xavier AGX use `corupta/mlc:0.20.0-r35.6.1-cp312-cu124-22.04` from docker hub. ``` docker run -dit --rm \ --name llm_server \ --gpus all \ -p 9000:9000 \ -e HF_HUB_CACHE=/root/.cache/huggingface \ -v /mnt/nvme/cache:/root/.cache \ mlc:r35.6.1-cp312-cu124-22.04 \ sudonim serve \ --model corupta/DeepSeek-R1-0528-Qwen3-8B-int4-AutoRound-inc-q4f16_awq-MLC \ --quantization q4f16_awq \ --max-batch-size 1 \ --host 0.0.0.0 \ --port 9000 ``` ## Jetpack5 Image is built with [corupta/jetson-containers-jp5](https://github.com/corupta/jetson-containers-jp5) ~~When running the model you might need to tweak `prefill_chunk` in sudonim or `prefill_chunk_size` in mlc-llm, to fit the model to your memory constraints.~~ The model is based on [Intel/DeepSeek-R1-0528-Qwen3-8B-int4-AutoRound-awq-inc](https://huggingface.co/Intel/DeepSeek-R1-0528-Qwen3-8B-int4-AutoRound-awq-inc/blob/main/README.md) and converted via below command (along with manual modifications to mlc-chat-config.json) ``` mlc_llm gen_config $LOCAL_MODEL_PATH \ --quantization $QUANTIZATION \ --conv-template $CONV_TEMPLATE \ -o $MLC_MODEL_PATH mlc_llm convert_weight $LOCAL_MODEL_PATH \ --quantization $QUANTIZATION \ -o $MLC_MODEL_PATH \ --source-format awq \ --source $LOCAL_MODEL_PATH/model.safetensors.index.json ``` ``` This model is an int4 model with group_size 128 and symmetric quantization of deepseek-ai/DeepSeek-R1-0528-Qwen3-8B generated by intel/auto-round algorithm. Please follow the license of the original model. ```