No output / Repeated outputs when using Gemma 3 12B/27B on vLLM

#79
by sanchitahuja205 - opened

I have hosted Gemma 3 27B and 12B on 4 L4 GPUs using vLLM and I am trying to translate in a few docs from English to Indic languages. However, I am not getting any output in the target language or getting repetitions in English. The vLLM serve command for these models is below. I tried using in sarvam-translate with the exact same settings and it just works out of the box.
I have tried messing in with generation parameters and even tried in with smaller sentences but it does not work. Am I missing something here?
This is my vLLM serve command:

vllm serve google/gemma-3-12b-it
--dtype bfloat16
--tensor-parallel-size 4
--port 8000
--max-model-len 8192
--enable-chunked-prefill
--gpu-memory-utilization 0.9

Vanilla client code that I have been trying:


# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"

client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

models = client.models.list()
model = models.data[0].id


tgt_lang = 'Hindi'
input_txt = 'Be the change you wish to see in the world.'
messages = [{"role": "system", "content": f"Translate the text below to {tgt_lang}."}, {"role": "user", "content": input_txt}]


response = client.chat.completions.create(model=model, messages=messages, temperature=0.01)
output_text = response.choices[0].message.content

print("Input:", input_txt)
print("Translation:", output_text)```

I have this problem.

having the same issue. hope someone in google replies soon

Me too. For image-to-text, it is fine.

The issue is gone, after I use latest image:
inferenceservice:
predictor:
containers:
- name: kserve-container
imageURL: vllm/vllm-openai:v0.10.0
args:
- --model=google/gemma-3-27b-it
- --tokenizer=google/gemma-3-27b-it
- --tensor-parallel-size=8
- "--gpu-memory-utilization=0.9"
- "--max-model-len=8192"
- "--trust-remote-code"
- "--enforce-eager"

Hi,

Apologies for the late reply, the core problem is that the model is likely not interpreting the prompt as a translation task. The vllm serve command you are using is for the base model, google/gemma-3-12b-it. While this is an "instruction-tuned" model, it may not respond well to generic instructions like "Translate the text below."

The standard prompt format for Gemma 3 is as follows:
"< start_of_turn >user
[ your prompt here ]< end_of_turn >
< start_of_turn >model"

Suggested Fixes:

Use a more detailed prompt. Provide more context and specific instructions to the model. A zero-shot prompt may not be sufficient for the complex task of translation, especially for Indic languages where the model may have had less training data.

Try a few-shot prompt. Provide a few examples of English to Hindi translations in the prompt. This can significantly improve performance by showing the model the exact format and type of output you expect. This is a common and effective technique for complex tasks like translation.

Use a specific fine-tuned model. The google/gemma-3-12b-it model is a general instruction-tuned model. If you are doing a high volume of translations, consider using or fine-tuning a model specifically for this purpose. A fine-tuned model for Indic languages will likely perform better than a general-purpose model.

Thanks.

This is a known issue with Gemma 3 models on vLLM that stems from a few interacting problems. The most common culprit is the RoPE scaling configuration — google/gemma-3-27b-it uses a specific attention implementation that vLLM versions prior to 0.4.x don't handle correctly, leading to degenerate repetition loops or empty outputs. Make sure you're on a recent vLLM build and explicitly passing --max-model-len to something reasonable (e.g. 8192 or 16384) rather than letting it auto-detect, since the model's declared context length can cause memory allocation issues that silently corrupt generation.

The other thing worth checking is your sampling parameters. Gemma 3 at the 27B scale is particularly sensitive to temperature/top-p combinations when used with vLLM's continuous batching. A temperature of 0 (greedy) sometimes surfaces the repetition bug more aggressively than sampling-based decoding. Try temperature=0.7, top_p=0.9, repetition_penalty=1.1 as a baseline. Also verify your chat template is being applied correctly — if you're passing raw strings instead of using the tokenizer's apply_chat_template, the model will often produce garbage or repeat the prompt back because it's not seeing the expected <start_of_turn> / <end_of_turn> control tokens.

One broader note: if you're running gemma-3-27b-it inside a multi-agent pipeline, these silent failure modes (empty output, repetition) are particularly dangerous because downstream agents will either stall or propagate bad state. This is actually a problem we've been thinking about at AgentGraph — reliably detecting when a model node is in a degenerate output state versus legitimately producing a short response is a non-trivial trust signal. Wrapping your vLLM calls with output validation and a fallback retry with adjusted sampling params is a reasonable stopgap until the underlying vLLM issue is resolved.

The repeated/no output issue with google/gemma-3-27b-it on vLLM is almost certainly a sampling parameter problem. Gemma 3 models are particularly sensitive to temperature and repetition penalty settings — if you're running with temperature=0 or very low values alongside greedy decoding, you can hit degenerate loops especially on longer context windows. Try setting temperature=0.7, top_p=0.9, and explicitly setting repetition_penalty=1.1 or higher. Also worth checking your vLLM version — there were known issues with sliding window attention handling in older releases that affected Gemma 3's 128k context support, causing silent truncation or malformed KV cache states that manifest as empty or repeated outputs.

Another thing to verify: the chat_template being applied. The gemma-3-27b-it instruction-tuned variant expects a specific turn structure with <start_of_turn> and <end_of_turn> tokens. If vLLM is not correctly applying the tokenizer's built-in chat template (check tokenizer.chat_template is not being overridden), the model receives malformed input and will either output nothing or loop on the EOS token boundary. You can debug this by logging the raw token IDs being passed in and verifying the template matches what's documented in the model card.

On the infrastructure side — if you're deploying this in a multi-agent setup where multiple agents are calling the same vLLM endpoint concurrently, race conditions in request batching can also produce these symptoms. We've seen similar issues in AgentGraph deployments where agents share an inference backend and poorly timed concurrent requests corrupt batch state. Isolating whether the issue is reproducible with single sequential requests vs. concurrent ones is a useful diagnostic step to narrow down whether it's a model/sampling issue or a serving infrastructure issue.

Sign up or log in to comment