How to use it correctly with online serving via vllm openai compatible server?

#55
by dhruvil237 - opened

Using the below command not sure if its setup correctly.
vllm serve deepseek-ai/DeepSeek-OCR --no-enable-prefix-caching --mm-processor-cache-gb 0 --logits-processors vllm.model_executor.models.deepseek_ocr:NGramPerReqLogitsProcessor

then calling it this way:
response = client.chat.completions.create(
model="deepseek-ai/DeepSeek-OCR",
messages=message,
temperature=0.0,
max_tokens=500,
# ngram logit processor args
extra_body={
"ngram_size": 30,
"window_size": 90,
"whitelist_token_ids": [128821, 128822],
"skip_special_tokens": False, # whitelist: ,
}
)

I am not sure if the parameters passed are affecting anything.
Can someone explain why are those parameters required and are the setup correctly?

corrected serving command:
vllm serve deepseek-ai/DeepSeek-OCR --no-enable-prefix-caching --mm-processor-cache-gb 0 --logits-processors vllm.model_executor.models.deepseek_ocr:NGramPerReqLogitsProcessor --enable-log-requests --gpu-memory-utilization 0.4 --chat-template /home/ubuntu/llm-ocr-exp/template_deepseek_ocr.jinja

inference:

response = client.chat.completions.create(
    model="deepseek-ai/DeepSeek-OCR",
    messages=message,
    temperature=0.0,
    max_tokens=500,
    # ngram logit processor args
    extra_body={
        "vllm_xargs": {
            "ngram_size": 30,
            "window_size": 90,
            # "whitelist_token_ids": [128821, 128822],
        },
        "skip_special_tokens": False,  # whitelist: <td>, </td>
    }
)
This comment has been hidden (marked as Resolved)

@dhruvilHV Can I see your --chat-template /home/ubuntu/llm-ocr-exp/template_deepseek_ocr.jinja?

What about the image sizes? Do we need to pass additional arguments to the API call and if so, how? For example, how to signal you want the gundam level of quality through this API call?

What about the image sizes? Do we need to pass additional arguments to the API call and if so, how? For example, how to signal you want the gundam level of quality through this API call?

same question with you, i dont know to choose model if using vLLM serve

Sign up or log in to comment