paligemma2 28b pt 896
googleIntroduction
PaliGemma 2 is a vision-language model (VLM) that integrates the capabilities of the PaliGemma and Gemma 2 models. It is designed for fine-tuning on a variety of vision-language tasks like image captioning, visual question answering, object detection, and segmentation. It supports multi-language inputs and outputs, making it versatile across different domains.
Architecture
PaliGemma 2 features a Transformer decoder and a Vision Transformer image encoder. The text decoder initializes from Gemma 2 at various parameter sizes (2B, 9B, and 27B), while the image encoder initializes from SigLIP-So400m/14. The model processes both image and text inputs to generate text outputs, supporting tasks like caption generation and object detection.
Training
PaliGemma 2 is pre-trained on a mixture of datasets, including WebLI, CC3M-35L, VQ²A-CC3M-35L, OpenImages, and WIT. The model uses responsible data filtering methods to ensure safety and appropriateness, filtering for pornographic, toxic, or sensitive content. Training utilizes the latest TPU hardware and software frameworks like JAX, Flax, and TFDS.
Guide: Running Locally
- Setup Environment: Install the
transformers
library and required dependencies usingpip install transformers
. - Load Model: Use the provided code snippet to load and run the model.
from transformers import PaliGemmaProcessor, PaliGemmaForConditionalGeneration from transformers.image_utils import load_image import torch model_id = "google/paligemma2-28b-pt-896" model = PaliGemmaForConditionalGeneration.from_pretrained(model_id, torch_dtype=torch.bfloat16, device_map="auto").eval() processor = PaliGemmaProcessor.from_pretrained(model_id) url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg" image = load_image(url) prompt = "" model_inputs = processor(text=prompt, images=image, return_tensors="pt").to(torch.bfloat16).to(model.device) with torch.inference_mode(): generation = model.generate(**model_inputs, max_new_tokens=100, do_sample=False) decoded = processor.decode(generation[0][model_inputs["input_ids"].shape[-1]:], skip_special_tokens=True) print(decoded)
- Fine-Tuning: Consider fine-tuning the model on specific tasks for improved performance.
- Cloud GPUs: For efficient computation, especially with large models, consider using cloud-based GPU services such as AWS, Google Cloud, or Azure.
License
The PaliGemma 2 model is distributed under the "gemma" license. Users must review and agree to Google's usage terms before accessing the model.