paligemma2 3b pt 224
googleIntroduction
PaliGemma 2 is an advanced vision-language model (VLM) developed by Google, built upon the capabilities of the Gemma 2 models. This model takes both image and text as input and generates text as output, supporting multiple languages. It is designed for high performance in vision-language tasks such as image captioning, visual question answering, text reading, object detection, and object segmentation.
Architecture
PaliGemma 2 integrates a Transformer decoder and a Vision Transformer image encoder. The text decoder is based on Gemma 2, available in 2B, 9B, and 27B parameter sizes, while the image encoder is initialized from SigLIP-So400m/14. This model follows the PaLI-3 training recipes, allowing it to effectively process and generate text from image and text inputs.
Training
PaliGemma 2 is pre-trained on diverse datasets like WebLI, CC3M-35L, and OpenImages, among others, using Google's Tensor Processing Unit (TPU) hardware. The training process incorporates JAX, Flax, and TFDS, ensuring efficient model training and dataset access. Comprehensive data responsibility filtering is applied to ensure safety and quality, including filtering based on content quality and safety.
Guide: Running Locally
-
Install Required Libraries
Ensure you have Python and install thetransformers
library using pip:pip install transformers
-
Load the Model and Processor
Use the following code snippet to load the model and processor:from transformers import PaliGemmaProcessor, PaliGemmaForConditionalGeneration import torch model_id = "google/paligemma2-3b-pt-224" model = PaliGemmaForConditionalGeneration.from_pretrained(model_id, torch_dtype=torch.bfloat16, device_map="auto").eval() processor = PaliGemmaProcessor.from_pretrained(model_id)
-
Prepare Inputs and Generate Text
Load your image and prepare inputs for the model:from transformers.image_utils import load_image url = "https://example.com/your-image.jpg" image = load_image(url) prompt = "" # Your text prompt here model_inputs = processor(text=prompt, images=image, return_tensors="pt").to(torch.bfloat16).to(model.device)
-
Run Inference
Generate text using the model:with torch.inference_mode(): generation = model.generate(**model_inputs, max_new_tokens=100, do_sample=False) decoded = processor.decode(generation[0][model_inputs["input_ids"].shape[-1]:], skip_special_tokens=True) print(decoded)
-
Cloud GPU Recommendation
For efficient performance, especially with large models, consider using cloud GPU services like AWS, Google Cloud, or Azure.
License
The PaliGemma 2 model is released under the Gemma license. Users must review and agree to Google's usage license to access the model on Hugging Face.