Add sparse attention weights & modify relevant description of model card
Browse files
README.md
CHANGED
|
@@ -106,6 +106,26 @@ python -m fastdeploy.entrypoints.openai.api_server \
|
|
| 106 |
--max-num-seqs 32
|
| 107 |
```
|
| 108 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 109 |
To deploy the W4A8C8 quantized version using FastDeploy, you can run the following command.
|
| 110 |
|
| 111 |
```bash
|
|
|
|
| 106 |
--max-num-seqs 32
|
| 107 |
```
|
| 108 |
|
| 109 |
+
To deploy the sparse attention version to speed up long context using FastDeploy, you can run the following command.
|
| 110 |
+
For more details about sparse attention, please refer to the [PLAS Attention](https://github.com/PaddlePaddle/FastDeploy/blob/develop/docs/features/plas_attention.md).
|
| 111 |
+
|
| 112 |
+
```bash
|
| 113 |
+
export FD_ATTENTION_BACKEND="PLAS_ATTN"
|
| 114 |
+
|
| 115 |
+
python -m fastdeploy.entrypoints.openai.api_server \
|
| 116 |
+
--model baidu/ERNIE-4.5-300B-A47B-Paddle \
|
| 117 |
+
--port 8180 \
|
| 118 |
+
--metrics-port 8181 \
|
| 119 |
+
--quantization wint4 \
|
| 120 |
+
--tensor-parallel-size 4 \
|
| 121 |
+
--engine-worker-queue-port 8182 \
|
| 122 |
+
--max-model-len 131072 \
|
| 123 |
+
--max-num-seqs 32 \
|
| 124 |
+
--max-num-batched-tokens 8192 \
|
| 125 |
+
--enable-chunked-prefill \
|
| 126 |
+
--plas-attention-config '{"plas_encoder_top_k_left": 50, "plas_encoder_top_k_right": 60,"plas_decoder_top_k_left": 100, "plas_decoder_top_k_right": 120}'
|
| 127 |
+
```
|
| 128 |
+
|
| 129 |
To deploy the W4A8C8 quantized version using FastDeploy, you can run the following command.
|
| 130 |
|
| 131 |
```bash
|