baidu
/

ERNIE-4.5-300B-A47B-Paddle

@@ -106,6 +106,26 @@ python -m fastdeploy.entrypoints.openai.api_server \
        --max-num-seqs 32
 ```
 To deploy the W4A8C8 quantized version using FastDeploy, you can run the following command.
 ```bash

        --max-num-seqs 32
 ```
+To deploy the sparse attention version to speed up long context using FastDeploy, you can run the following command.
+For more details about sparse attention, please refer to the [PLAS Attention](https://github.com/PaddlePaddle/FastDeploy/blob/develop/docs/features/plas_attention.md).
+```bash
+export FD_ATTENTION_BACKEND="PLAS_ATTN"
+python -m fastdeploy.entrypoints.openai.api_server \
+       --model baidu/ERNIE-4.5-300B-A47B-Paddle  \
+       --port 8180 \
+       --metrics-port 8181 \
+       --quantization wint4 \
+       --tensor-parallel-size 4 \
+       --engine-worker-queue-port 8182 \
+       --max-model-len 131072 \
+       --max-num-seqs 32 \
+       --max-num-batched-tokens 8192 \
+       --enable-chunked-prefill \
+       --plas-attention-config '{"plas_encoder_top_k_left": 50, "plas_encoder_top_k_right": 60,"plas_decoder_top_k_left": 100, "plas_decoder_top_k_right": 120}'
+```
 To deploy the W4A8C8 quantized version using FastDeploy, you can run the following command.
 ```bash