Instructions to use google/paligemma2-3b-pt-224 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use google/paligemma2-3b-pt-224 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="google/paligemma2-3b-pt-224")

# Load model directly
from transformers import AutoProcessor, AutoModelForImageTextToText

processor = AutoProcessor.from_pretrained("google/paligemma2-3b-pt-224")
model = AutoModelForImageTextToText.from_pretrained("google/paligemma2-3b-pt-224")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use google/paligemma2-3b-pt-224 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "google/paligemma2-3b-pt-224"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "google/paligemma2-3b-pt-224",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/google/paligemma2-3b-pt-224

SGLang

How to use google/paligemma2-3b-pt-224 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "google/paligemma2-3b-pt-224" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "google/paligemma2-3b-pt-224",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "google/paligemma2-3b-pt-224" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "google/paligemma2-3b-pt-224",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use google/paligemma2-3b-pt-224 with Docker Model Runner:
```
docker model run hf.co/google/paligemma2-3b-pt-224
```

Warning "It is strongly recommended to train Gemma2 models with the `eager` attention implementation instead of `sdpa`"

by jesusgs01 - opened Mar 29, 2025

Discussion

jesusgs01

Mar 29, 2025

•

edited Mar 29, 2025

Hello,

I am trying to finetuning the Paligemma2. I have the following code:

model_id = "google/paligemma2-3b-pt-224"

model = PaliGemmaForConditionalGeneration.from_pretrained(
        model_id, 
        torch_dtype=torch.bfloat16,
        low_cpu_mem_usage=True,
        device_map="auto",
    )

processor = AutoProcessor.from_pretrained(model_id)
processor.tokenizer.padding_side = "left"
processor.tokenizer.pad_token = processor.tokenizer.eos_token

return model, processor

When I start my training I have the following warning:

It is strongly recommended to train Gemma2 models with the eager attention implementation instead of sdpa. Use eager with AutoModelForCausalLM.from_pretrained('', attn_implementation='eager')`.

I assume this might impact the model's performance and convergence. Does anyone know how to properly change this setting?

Thanks in advance!

GopiUppari

Google org Apr 10, 2025

Hi @jesusgs01 ,

Using eager attention can help avoid NaN issues during training, but it might cause the model to converge faster which sometimes leads to overfitting. On the other hand, flash_attention_2 (or sdpa) may train more slowly but tends to generalize better.

To switch between these modes, you can set the attn_implementation parameter when loading the model with PaliGemmaForConditionalGeneration.from_pretrained().

Use attn_implementation="eager" for eager attention
Use attn_implementation="sdpa" to enable SDPA (scaled dot product attention)

For more details, could you please check this reference.

Thank you.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment