paligemma2 3b pt 448
googleIntroduction
PaliGemma 2 is an advanced vision-language model (VLM) that processes images and text to generate textual outputs. This model is an enhancement of the PaliGemma model and integrates features from the Gemma 2 language models, designed to excel in tasks such as image and video captioning, visual question answering, and object detection.
Architecture
PaliGemma 2 combines a Transformer decoder and a Vision Transformer image encoder. The text decoder is initialized from Gemma 2, while the image encoder uses SigLIP-So400m/14. It is configured for fine-tuning with input images of 448x448 pixels and text sequences of up to 512 tokens. The model supports bfloat16 format for efficient computation.
Training
PaliGemma 2 is pre-trained on a mix of datasets, including WebLI, CC3M-35L, VQ²A-CC3M-35L, OpenImages, and WIT. The training includes responsible data filtering to ensure safety and compliance with ethical guidelines. It utilizes TPUv5e hardware and software tools such as JAX, Flax, and TFDS to leverage the latest in machine learning technology.
Guide: Running Locally
To run PaliGemma 2 locally, follow these steps:
- Install Dependencies: Ensure you have Python and the
transformers
library installed. - Load Model: Use the
PaliGemmaForConditionalGeneration
andPaliGemmaProcessor
classes to load the model and processor. - Prepare Inputs: Use
load_image
to obtain your image and prepare text inputs. - Generate Output: Process inputs, generate text using the model, and decode the output.
from transformers import PaliGemmaProcessor, PaliGemmaForConditionalGeneration
from transformers.image_utils import load_image
import torch
model_id = "google/paligemma2-3b-pt-448"
image = load_image("your_image_url")
model = PaliGemmaForConditionalGeneration.from_pretrained(model_id, torch_dtype=torch.bfloat16, device_map="auto").eval()
processor = PaliGemmaProcessor.from_pretrained(model_id)
prompt = ""
model_inputs = processor(text=prompt, images=image, return_tensors="pt").to(torch.bfloat16).to(model.device)
with torch.inference_mode():
generation = model.generate(**model_inputs, max_new_tokens=100, do_sample=False)
decoded = processor.decode(generation[0][model_inputs["input_ids"].shape[-1]:], skip_special_tokens=True)
print(decoded)
Suggested Cloud GPUs
For improved performance, consider using cloud services offering GPUs such as Google Cloud Platform or AWS EC2 instances with GPU support.
License
PaliGemma 2 is available under the Gemma license. Users must review and agree to Google's usage terms before accessing the model.