paligemma2 10b pt 448
googlePaliGemma 2 Model Card
Introduction
PaliGemma 2 is an advanced vision-language model (VLM) developed by Google, integrating the functionalities of both the Gemma 2 language models and the SigLIP vision model. This model is designed to handle image and text inputs, producing textual outputs and supporting multiple languages. Its primary applications include image and video captioning, visual question answering, object detection, and segmentation.
Architecture
PaliGemma 2 combines a Transformer decoder, derived from the Gemma 2 language model, with a Vision Transformer image encoder based on SigLIP. The text decoder is available in configurations of 2B, 9B, and 27B parameters, while the image encoder is adapted from SigLIP-So400m/14. The model is trained using techniques outlined in the PaLI-3 framework.
Training
PaliGemma 2 was pre-trained on a diverse set of datasets including WebLI, CC3M-35L, and OpenImages, among others. The model underwent ethical data filtering to ensure safe and responsible use, applying filters for pornographic content, text safety, and text toxicity. Training involved the use of JAX, Flax, TFDS, and the big_vision library on TPUv5e hardware.
Guide: Running Locally
To run PaliGemma 2 locally, follow these steps:
- Install Dependencies: Ensure you have Python installed along with the
transformers
library. - Load the Model: Use the
PaliGemmaProcessor
andPaliGemmaForConditionalGeneration
classes from thetransformers
library. - Prepare Inputs: Load your image using
load_image
and prepare your text input. - Execute: Run the model for text generation, ensuring the model is set to evaluation mode.
- Cloud GPUs: For optimal performance, especially with larger model sizes, consider using cloud-based GPUs from providers like AWS, GCP, or Azure.
License
PaliGemma 2 is released under the Gemma license, which requires users to agree to Google's terms of use before accessing the model.