PaliGemma Model Summary

Introduction

PaliGemma is a versatile vision-language model (VLM) developed by Google, capable of processing both images and text to generate textual outputs. It is inspired by the PaLI-3 model and integrates components from the SigLIP vision model and the Gemma language model. It is designed for tasks such as image captioning, visual question answering, and object detection.

Architecture

The architecture of PaliGemma includes a Transformer decoder and a Vision Transformer encoder, totaling 3 billion parameters. The text decoder is initialized from Gemma-2B, while the image encoder is initialized from SigLIP-So400m/14. It is trained following the PaLI-3 methodologies.

Training

PaliGemma is pre-trained on a mixture of datasets, including WebLI, CC3M-35L, VQ²A-CC3M-35L, OpenImages, and WIT. It employs data responsibility filtering to ensure clean data, removing content deemed pornographic, unsafe, toxic, or personally identifying. Training utilizes TPU hardware and software like JAX and Flax.

Guide: Running Locally

Install dependencies: Ensure you have transformers, torch, PIL, and optionally bitsandbytes for 8-bit precision.
Load the model: Use the PaliGemmaForConditionalGeneration class from transformers.
Prepare inputs: Load an image and create a text prompt.
Run inference: Use the model to generate text based on the provided inputs.
Using GPUs: For performance, consider using cloud GPUs such as those offered by AWS or Google Cloud.

Example Code

from transformers import AutoProcessor, PaliGemmaForConditionalGeneration
from PIL import Image
import requests
import torch

model_id = "google/paligemma-3b-mix-224"
device = "cuda:0"  # Use GPU if available
dtype = torch.bfloat16  # Adjust precision if needed

# Load image
url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg?download=true"
image = Image.open(requests.get(url, stream=True).raw)

# Load model and processor
model = PaliGemmaForConditionalGeneration.from_pretrained(model_id, torch_dtype=dtype, device_map=device).eval()
processor = AutoProcessor.from_pretrained(model_id)

# Prepare input
prompt = "caption es"
model_inputs = processor(text=prompt, images=image, return_tensors="pt").to(model.device)

# Generate text
with torch.inference_mode():
    generation = model.generate(**model_inputs, max_new_tokens=100, do_sample=False)
    decoded = processor.decode(generation[0][model_inputs["input_ids"].shape[-1]:], skip_special_tokens=True)
    print(decoded)

License

The PaliGemma model is released under the Gemma license, requiring users to review and agree to Google's usage license. Access is provided through Hugging Face upon acknowledgment of the license terms.

More Related APIs in Image Text To Text