paligemma2 10b pt 448

google

PaliGemma 2 Model Card

Introduction

PaliGemma 2 is an advanced vision-language model (VLM) developed by Google, integrating the functionalities of both the Gemma 2 language models and the SigLIP vision model. This model is designed to handle image and text inputs, producing textual outputs and supporting multiple languages. Its primary applications include image and video captioning, visual question answering, object detection, and segmentation.

Architecture

PaliGemma 2 combines a Transformer decoder, derived from the Gemma 2 language model, with a Vision Transformer image encoder based on SigLIP. The text decoder is available in configurations of 2B, 9B, and 27B parameters, while the image encoder is adapted from SigLIP-So400m/14. The model is trained using techniques outlined in the PaLI-3 framework.

Training

PaliGemma 2 was pre-trained on a diverse set of datasets including WebLI, CC3M-35L, and OpenImages, among others. The model underwent ethical data filtering to ensure safe and responsible use, applying filters for pornographic content, text safety, and text toxicity. Training involved the use of JAX, Flax, TFDS, and the big_vision library on TPUv5e hardware.

Guide: Running Locally

To run PaliGemma 2 locally, follow these steps:

  1. Install Dependencies: Ensure you have Python installed along with the transformers library.
  2. Load the Model: Use the PaliGemmaProcessor and PaliGemmaForConditionalGeneration classes from the transformers library.
  3. Prepare Inputs: Load your image using load_image and prepare your text input.
  4. Execute: Run the model for text generation, ensuring the model is set to evaluation mode.
  5. Cloud GPUs: For optimal performance, especially with larger model sizes, consider using cloud-based GPUs from providers like AWS, GCP, or Azure.

License

PaliGemma 2 is released under the Gemma license, which requires users to agree to Google's terms of use before accessing the model.

More Related APIs in Image Text To Text