paligemma 3b pt 224
googlePaliGemma Model Summary
Introduction
PaliGemma is a versatile vision-language model (VLM) developed by Google, capable of processing both images and text to generate textual outputs. It is inspired by the PaLI-3 model and integrates components from the SigLIP vision model and the Gemma language model. It is designed for tasks such as image captioning, visual question answering, and object detection.
Architecture
The architecture of PaliGemma includes a Transformer decoder and a Vision Transformer encoder, totaling 3 billion parameters. The text decoder is initialized from Gemma-2B, while the image encoder is initialized from SigLIP-So400m/14. It is trained following the PaLI-3 methodologies.
Training
PaliGemma is pre-trained on a mixture of datasets, including WebLI, CC3M-35L, VQ²A-CC3M-35L, OpenImages, and WIT. It employs data responsibility filtering to ensure clean data, removing content deemed pornographic, unsafe, toxic, or personally identifying. Training utilizes TPU hardware and software like JAX and Flax.
Guide: Running Locally
- Install dependencies: Ensure you have
transformers
,torch
,PIL
, and optionallybitsandbytes
for 8-bit precision. - Load the model: Use the
PaliGemmaForConditionalGeneration
class fromtransformers
. - Prepare inputs: Load an image and create a text prompt.
- Run inference: Use the model to generate text based on the provided inputs.
- Using GPUs: For performance, consider using cloud GPUs such as those offered by AWS or Google Cloud.
Example Code
from transformers import AutoProcessor, PaliGemmaForConditionalGeneration
from PIL import Image
import requests
import torch
model_id = "google/paligemma-3b-mix-224"
device = "cuda:0" # Use GPU if available
dtype = torch.bfloat16 # Adjust precision if needed
# Load image
url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg?download=true"
image = Image.open(requests.get(url, stream=True).raw)
# Load model and processor
model = PaliGemmaForConditionalGeneration.from_pretrained(model_id, torch_dtype=dtype, device_map=device).eval()
processor = AutoProcessor.from_pretrained(model_id)
# Prepare input
prompt = "caption es"
model_inputs = processor(text=prompt, images=image, return_tensors="pt").to(model.device)
# Generate text
with torch.inference_mode():
generation = model.generate(**model_inputs, max_new_tokens=100, do_sample=False)
decoded = processor.decode(generation[0][model_inputs["input_ids"].shape[-1]:], skip_special_tokens=True)
print(decoded)
License
The PaliGemma model is released under the Gemma license, requiring users to review and agree to Google's usage license. Access is provided through Hugging Face upon acknowledgment of the license terms.