Instructions to use google/paligemma2-3b-pt-224 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use google/paligemma2-3b-pt-224 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="google/paligemma2-3b-pt-224")# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("google/paligemma2-3b-pt-224") model = AutoModelForImageTextToText.from_pretrained("google/paligemma2-3b-pt-224") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use google/paligemma2-3b-pt-224 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "google/paligemma2-3b-pt-224" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "google/paligemma2-3b-pt-224", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/google/paligemma2-3b-pt-224
- SGLang
How to use google/paligemma2-3b-pt-224 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "google/paligemma2-3b-pt-224" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "google/paligemma2-3b-pt-224", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "google/paligemma2-3b-pt-224" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "google/paligemma2-3b-pt-224", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use google/paligemma2-3b-pt-224 with Docker Model Runner:
docker model run hf.co/google/paligemma2-3b-pt-224
Warning "It is strongly recommended to train Gemma2 models with the `eager` attention implementation instead of `sdpa`"
Hello,
I am trying to finetuning the Paligemma2. I have the following code:
model_id = "google/paligemma2-3b-pt-224"
model = PaliGemmaForConditionalGeneration.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
device_map="auto",
)
processor = AutoProcessor.from_pretrained(model_id)
processor.tokenizer.padding_side = "left"
processor.tokenizer.pad_token = processor.tokenizer.eos_token
return model, processor
When I start my training I have the following warning:
It is strongly recommended to train Gemma2 models with the eager attention implementation instead of sdpa. Use eager with AutoModelForCausalLM.from_pretrained('', attn_implementation='eager')`.
I assume this might impact the model's performance and convergence. Does anyone know how to properly change this setting?
Thanks in advance!
Hi @jesusgs01 ,
Using eager attention can help avoid NaN issues during training, but it might cause the model to converge faster which sometimes leads to overfitting. On the other hand, flash_attention_2 (or sdpa) may train more slowly but tends to generalize better.
To switch between these modes, you can set the attn_implementation parameter when loading the model with PaliGemmaForConditionalGeneration.from_pretrained().
- Use
attn_implementation="eager"for eager attention - Use
attn_implementation="sdpa"to enable SDPA (scaled dot product attention)
For more details, could you please check this reference.
Thank you.