paligemma2 3b ft docci 448 jax

google

Introduction

PaliGemma 2 is an advanced vision-language model (VLM) designed to handle image and text inputs and generate text outputs across various languages. It enhances the capabilities of the previous PaliGemma model by incorporating features from the Gemma 2 models, allowing for superior performance in tasks like image captioning, visual question answering, and object detection.

Architecture

PaliGemma 2 combines a Transformer decoder and a Vision Transformer image encoder. The text decoder is initialized from the Gemma 2 models, and the image encoder is built on SigLIP-So400m/14. This architecture facilitates the model's ability to process and generate text from both image and text inputs.

Training

PaliGemma 2 is trained using JAX, Flax, and TFDS on TPUs, leveraging datasets like WebLI, CC3M-35L, and OpenImages. It employs filtering techniques to ensure data responsibility, removing inappropriate content and personal information. The model is fine-tuned with 448x448 input images on the DOCCI dataset, and it supports transfer to a wide range of vision-language tasks.

Guide: Running Locally

  1. Authenticate using the Hugging Face CLI:
    huggingface-cli login
    
  2. Download the model weights:
    huggingface-cli download --local-dir models google/paligemma2-3b-ft-docci-448-jax
    
    This will save the weights to the models directory.
  3. Set up the environment: Ensure you have JAX and Flax installed.
  4. Run the model using the big_vision codebase for inference and fine-tuning.

For optimal performance, consider using cloud GPUs, such as those available on Google Cloud or AWS.

License

The PaliGemma 2 model is distributed under the Gemma license. Users must review and agree to Google's usage license before accessing the model through Hugging Face.

More Related APIs in Image Text To Text