feihu.hf
commited on
Commit
·
a68211a
1
Parent(s):
cba1e86
update README
Browse files
README.md
CHANGED
|
@@ -206,7 +206,14 @@ For full technical details, see the [Qwen2.5-1M Technical Report](https://arxiv.
|
|
| 206 |
|
| 207 |
#### Step 1: Update Configuration File
|
| 208 |
|
| 209 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 210 |
|
| 211 |
#### Step 2: Launch Model Server
|
| 212 |
|
|
@@ -226,7 +233,7 @@ Then launch the server with Dual Chunk Flash Attention enabled:
|
|
| 226 |
|
| 227 |
```bash
|
| 228 |
VLLM_ATTENTION_BACKEND=DUAL_CHUNK_FLASH_ATTN VLLM_USE_V1=0 \
|
| 229 |
-
vllm serve
|
| 230 |
--tensor-parallel-size 8 \
|
| 231 |
--max-model-len 1010000 \
|
| 232 |
--enable-chunked-prefill \
|
|
@@ -262,7 +269,7 @@ Launch the server with DCA support:
|
|
| 262 |
|
| 263 |
```bash
|
| 264 |
python3 -m sglang.launch_server \
|
| 265 |
-
--model-path
|
| 266 |
--context-length 1010000 \
|
| 267 |
--mem-frac 0.75 \
|
| 268 |
--attention-backend dual_chunk_flash_attn \
|
|
|
|
| 206 |
|
| 207 |
#### Step 1: Update Configuration File
|
| 208 |
|
| 209 |
+
Download the model and replace the content of your `config.json` with `config_1m.json`, which includes the config for length extrapolation and sparse attention.
|
| 210 |
+
|
| 211 |
+
```bash
|
| 212 |
+
export MODELNAME=Qwen3-235B-A22B-Instruct-2507
|
| 213 |
+
huggingface-cli download Qwen/${MODELNAME} --local-dir ${MODELNAME}
|
| 214 |
+
mv ${MODELNAME}/config.json ${MODELNAME}/config.json.bak
|
| 215 |
+
mv ${MODELNAME}/config_1m.json ${MODELNAME}/config.json
|
| 216 |
+
```
|
| 217 |
|
| 218 |
#### Step 2: Launch Model Server
|
| 219 |
|
|
|
|
| 233 |
|
| 234 |
```bash
|
| 235 |
VLLM_ATTENTION_BACKEND=DUAL_CHUNK_FLASH_ATTN VLLM_USE_V1=0 \
|
| 236 |
+
vllm serve ./Qwen3-235B-A22B-Instruct-2507 \
|
| 237 |
--tensor-parallel-size 8 \
|
| 238 |
--max-model-len 1010000 \
|
| 239 |
--enable-chunked-prefill \
|
|
|
|
| 269 |
|
| 270 |
```bash
|
| 271 |
python3 -m sglang.launch_server \
|
| 272 |
+
--model-path ./Qwen3-235B-A22B-Instruct-2507 \
|
| 273 |
--context-length 1010000 \
|
| 274 |
--mem-frac 0.75 \
|
| 275 |
--attention-backend dual_chunk_flash_attn \
|