paligemma2 3b pt 448

google

Introduction

PaliGemma 2 is an advanced vision-language model (VLM) that processes images and text to generate textual outputs. This model is an enhancement of the PaliGemma model and integrates features from the Gemma 2 language models, designed to excel in tasks such as image and video captioning, visual question answering, and object detection.

Architecture

PaliGemma 2 combines a Transformer decoder and a Vision Transformer image encoder. The text decoder is initialized from Gemma 2, while the image encoder uses SigLIP-So400m/14. It is configured for fine-tuning with input images of 448x448 pixels and text sequences of up to 512 tokens. The model supports bfloat16 format for efficient computation.

Training

PaliGemma 2 is pre-trained on a mix of datasets, including WebLI, CC3M-35L, VQ²A-CC3M-35L, OpenImages, and WIT. The training includes responsible data filtering to ensure safety and compliance with ethical guidelines. It utilizes TPUv5e hardware and software tools such as JAX, Flax, and TFDS to leverage the latest in machine learning technology.

Guide: Running Locally

To run PaliGemma 2 locally, follow these steps:

  1. Install Dependencies: Ensure you have Python and the transformers library installed.
  2. Load Model: Use the PaliGemmaForConditionalGeneration and PaliGemmaProcessor classes to load the model and processor.
  3. Prepare Inputs: Use load_image to obtain your image and prepare text inputs.
  4. Generate Output: Process inputs, generate text using the model, and decode the output.
from transformers import PaliGemmaProcessor, PaliGemmaForConditionalGeneration
from transformers.image_utils import load_image
import torch

model_id = "google/paligemma2-3b-pt-448"
image = load_image("your_image_url")
model = PaliGemmaForConditionalGeneration.from_pretrained(model_id, torch_dtype=torch.bfloat16, device_map="auto").eval()
processor = PaliGemmaProcessor.from_pretrained(model_id)

prompt = ""
model_inputs = processor(text=prompt, images=image, return_tensors="pt").to(torch.bfloat16).to(model.device)

with torch.inference_mode():
    generation = model.generate(**model_inputs, max_new_tokens=100, do_sample=False)
    decoded = processor.decode(generation[0][model_inputs["input_ids"].shape[-1]:], skip_special_tokens=True)
    print(decoded)

Suggested Cloud GPUs

For improved performance, consider using cloud services offering GPUs such as Google Cloud Platform or AWS EC2 instances with GPU support.

License

PaliGemma 2 is available under the Gemma license. Users must review and agree to Google's usage terms before accessing the model.

More Related APIs in Image Text To Text