How to use it correctly with online serving via vllm openai compatible server?
Using the below command not sure if its setup correctly.
vllm serve deepseek-ai/DeepSeek-OCR --no-enable-prefix-caching --mm-processor-cache-gb 0 --logits-processors vllm.model_executor.models.deepseek_ocr:NGramPerReqLogitsProcessor
then calling it this way:
response = client.chat.completions.create(
model="deepseek-ai/DeepSeek-OCR",
messages=message,
temperature=0.0,
max_tokens=500,
# ngram logit processor args
extra_body={
"ngram_size": 30,
"window_size": 90,
"whitelist_token_ids": [128821, 128822],
"skip_special_tokens": False, # whitelist: ,
}
)
I am not sure if the parameters passed are affecting anything.
Can someone explain why are those parameters required and are the setup correctly?
corrected serving command:vllm serve deepseek-ai/DeepSeek-OCR --no-enable-prefix-caching --mm-processor-cache-gb 0 --logits-processors vllm.model_executor.models.deepseek_ocr:NGramPerReqLogitsProcessor --enable-log-requests --gpu-memory-utilization 0.4 --chat-template /home/ubuntu/llm-ocr-exp/template_deepseek_ocr.jinja
inference:
response = client.chat.completions.create(
model="deepseek-ai/DeepSeek-OCR",
messages=message,
temperature=0.0,
max_tokens=500,
# ngram logit processor args
extra_body={
"vllm_xargs": {
"ngram_size": 30,
"window_size": 90,
# "whitelist_token_ids": [128821, 128822],
},
"skip_special_tokens": False, # whitelist: <td>, </td>
}
)
@dhruvilHV Can I see your --chat-template /home/ubuntu/llm-ocr-exp/template_deepseek_ocr.jinja?
What about the image sizes? Do we need to pass additional arguments to the API call and if so, how? For example, how to signal you want the gundam level of quality through this API call?
What about the image sizes? Do we need to pass additional arguments to the API call and if so, how? For example, how to signal you want the gundam level of quality through this API call?
same question with you, i dont know to choose model if using vLLM serve