add vllm example in readme
Browse files
README.md
CHANGED
|
@@ -112,4 +112,30 @@ The capital city of South Korea is Seoul.
|
|
| 112 |
```
|
| 113 |
|
| 114 |
## How to use in vllm
|
| 115 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 112 |
```
|
| 113 |
|
| 114 |
## How to use in vllm
|
| 115 |
+
Currently, [PR]() for supporting motif model in official vllm package is under review.
|
| 116 |
+
To use our model with vllm, use this [image](https://github.com/motiftechnologies/vllm/pkgs/container/vllm)
|
| 117 |
+
Our model supports 32K seq length
|
| 118 |
+
|
| 119 |
+
The [PR](https://github.com/vllm-project/vllm/pull/27396) adding support for the Motif model in the official vLLM package is currently under review.
|
| 120 |
+
|
| 121 |
+
In the meantime, to use our model with vLLM, please use the following container [image](https://github.com/motiftechnologies/vllm/pkgs/container/vllm).
|
| 122 |
+
Our model supports a sequence length of up to 32K tokens.
|
| 123 |
+
```bash
|
| 124 |
+
# run vllm api server
|
| 125 |
+
VLLM_ATTENTION_BACKEND="DIFFERENTIAL_FLASH_ATTN" vllm serve Motif-Technologies/Motif-2-12.7B-Instruct --trust-remote-code --data-parallel-size <gpu_count>
|
| 126 |
+
|
| 127 |
+
# sending requests with curl
|
| 128 |
+
curl http://localhost:8000/v1/chat/completions \
|
| 129 |
+
-H "Content-Type: application/json" \
|
| 130 |
+
-d '{
|
| 131 |
+
"messages": [
|
| 132 |
+
{"role": "system", "content": "You are a helpful assistant."},
|
| 133 |
+
{"role": "user", "content": "What is the capital city of South Korea?"}
|
| 134 |
+
],
|
| 135 |
+
"temperature": 0.6,
|
| 136 |
+
"skip_special_tokens": false,
|
| 137 |
+
"chat_template_kwargs": {
|
| 138 |
+
"enable_thinking": true
|
| 139 |
+
}
|
| 140 |
+
}'
|
| 141 |
+
```
|