paligemma2 3b pt 896

google

PaliGemma 2 Model Documentation

Introduction

PaliGemma 2 is an advanced vision-language model (VLM) developed by Google. It extends the capabilities of previous models like Gemma 2, integrating both image and text inputs to generate text outputs. The model is suitable for tasks such as image captioning, visual question answering, object detection, and more, supporting multiple languages.

Architecture

PaliGemma 2 combines a Transformer decoder with a Vision Transformer image encoder. The text decoder is derived from Gemma 2, and the image encoder is based on the SigLIP vision model. It uses a bfloat16 format for its weights, allowing for efficient fine-tuning.

Training

PaliGemma 2 is trained on a diverse set of datasets, including WebLI, CC3M-35L, and OpenImages, among others. It is designed to filter out unsafe content using various safety mechanisms. The training utilizes the latest TPUv5e hardware and is supported by JAX, Flax, and TFDS software.

Guide: Running Locally

  1. Set Up Environment: Install the necessary Python packages, including transformers and torch.
  2. Load Model: Use PaliGemmaForConditionalGeneration and PaliGemmaProcessor from the transformers library to load the pre-trained model.
  3. Run Inference: Prepare inputs (images and text), and execute the model to generate outputs.
  4. Hardware Recommendations: Utilize cloud GPUs, such as those offered by Google Cloud Platform, for optimal performance.

License

The PaliGemma 2 model uses the Gemma license. Access to the model requires reviewing and agreeing to Google's usage terms. Ensure compliance with ethical guidelines and usage policies.

More Related APIs in Image Text To Text