paligemma2 3b pt 896
googlePaliGemma 2 Model Documentation
Introduction
PaliGemma 2 is an advanced vision-language model (VLM) developed by Google. It extends the capabilities of previous models like Gemma 2, integrating both image and text inputs to generate text outputs. The model is suitable for tasks such as image captioning, visual question answering, object detection, and more, supporting multiple languages.
Architecture
PaliGemma 2 combines a Transformer decoder with a Vision Transformer image encoder. The text decoder is derived from Gemma 2, and the image encoder is based on the SigLIP vision model. It uses a bfloat16 format for its weights, allowing for efficient fine-tuning.
Training
PaliGemma 2 is trained on a diverse set of datasets, including WebLI, CC3M-35L, and OpenImages, among others. It is designed to filter out unsafe content using various safety mechanisms. The training utilizes the latest TPUv5e hardware and is supported by JAX, Flax, and TFDS software.
Guide: Running Locally
- Set Up Environment: Install the necessary Python packages, including
transformers
andtorch
. - Load Model: Use
PaliGemmaForConditionalGeneration
andPaliGemmaProcessor
from thetransformers
library to load the pre-trained model. - Run Inference: Prepare inputs (images and text), and execute the model to generate outputs.
- Hardware Recommendations: Utilize cloud GPUs, such as those offered by Google Cloud Platform, for optimal performance.
License
The PaliGemma 2 model uses the Gemma license. Access to the model requires reviewing and agreeing to Google's usage terms. Ensure compliance with ethical guidelines and usage policies.