Instructions to use google/paligemma-3b-pt-896 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use google/paligemma-3b-pt-896 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="google/paligemma-3b-pt-896")# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("google/paligemma-3b-pt-896") model = AutoModelForImageTextToText.from_pretrained("google/paligemma-3b-pt-896") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use google/paligemma-3b-pt-896 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "google/paligemma-3b-pt-896" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "google/paligemma-3b-pt-896", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/google/paligemma-3b-pt-896
- SGLang
How to use google/paligemma-3b-pt-896 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "google/paligemma-3b-pt-896" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "google/paligemma-3b-pt-896", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "google/paligemma-3b-pt-896" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "google/paligemma-3b-pt-896", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use google/paligemma-3b-pt-896 with Docker Model Runner:
docker model run hf.co/google/paligemma-3b-pt-896
Training version 896
Has anyone managed to train this model yet? Due to the large image token sequence length, it requires using a sharding strategy. I setup all my code with Lightning rather than HF and accelerate (which was probably a mistake) and am still unable to get it to train - I am running into errors with FSDP. Wondering if anyone managed to train the model.
If this is of any help to others, using Lightning and FSDP with 8 A100-80GB I am now able to train it with a batch size of 2 in fp32. I'll try to release my fine-tuning code soon. They key here is the activation checkpointing as it's the activations taking up the vast majority of VRAM due to the large sequence length.
Hi @kwin-sustainment , I'm quite busy finishing a project at the moment and was planning a release in around two weeks time. However, if it's urgent for you, I can send over the main elements
@lcolonn hey no worries! I actually got it working after I delved into your hint about that checkpointing & VRAM. Thank you for the advice :)
Edit: I think you need a MINIMUM of a 24GB card to train this model for people looking at this thread in the future. It would be a batch size of one and take forever, but it'd work...
Hi @kwin-sustainment , I hope the issue has been resolved feel free to close this issue . Please let us know if any further assistance is needed. Thanks!