No output / Repeated outputs when using Gemma 3 12B/27B on vLLM
I have hosted Gemma 3 27B and 12B on 4 L4 GPUs using vLLM and I am trying to translate in a few docs from English to Indic languages. However, I am not getting any output in the target language or getting repetitions in English. The vLLM serve command for these models is below. I tried using in sarvam-translate with the exact same settings and it just works out of the box.
I have tried messing in with generation parameters and even tried in with smaller sentences but it does not work. Am I missing something here?
This is my vLLM serve command:
vllm serve google/gemma-3-12b-it 
  --dtype bfloat16 
  --tensor-parallel-size 4 
  --port 8000 
  --max-model-len 8192 
  --enable-chunked-prefill 
  --gpu-memory-utilization 0.9
Vanilla client code that I have been trying:
# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"
client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)
models = client.models.list()
model = models.data[0].id
tgt_lang = 'Hindi'
input_txt = 'Be the change you wish to see in the world.'
messages = [{"role": "system", "content": f"Translate the text below to {tgt_lang}."}, {"role": "user", "content": input_txt}]
response = client.chat.completions.create(model=model, messages=messages, temperature=0.01)
output_text = response.choices[0].message.content
print("Input:", input_txt)
print("Translation:", output_text)```
I have this problem.
having the same issue. hope someone in google replies soon
Me too. For image-to-text, it is fine.
The issue is gone, after I use latest image:
inferenceservice:
  predictor:
    containers:
      - name: kserve-container
        imageURL: vllm/vllm-openai:v0.10.0
        args:
          - --model=google/gemma-3-27b-it
          - --tokenizer=google/gemma-3-27b-it
          - --tensor-parallel-size=8
          - "--gpu-memory-utilization=0.9"
          - "--max-model-len=8192"
          - "--trust-remote-code"
          - "--enforce-eager"
Hi,
Apologies for the late reply, the core problem is that the model is likely not interpreting the prompt as a translation task. The vllm serve command you are using is for the base model, google/gemma-3-12b-it. While this is an "instruction-tuned" model, it may not respond well to generic instructions like "Translate the text below."
The standard prompt format for Gemma 3 is as follows:
"< start_of_turn >user
[ your prompt here ]< end_of_turn >
< start_of_turn >model"
Suggested Fixes:
Use a more detailed prompt. Provide more context and specific instructions to the model. A zero-shot prompt may not be sufficient for the complex task of translation, especially for Indic languages where the model may have had less training data.
Try a few-shot prompt. Provide a few examples of English to Hindi translations in the prompt. This can significantly improve performance by showing the model the exact format and type of output you expect. This is a common and effective technique for complex tasks like translation.
Use a specific fine-tuned model. The google/gemma-3-12b-it model is a general instruction-tuned model. If you are doing a high volume of translations, consider using or fine-tuning a model specifically for this purpose. A fine-tuned model for Indic languages will likely perform better than a general-purpose model.
Thanks.
