paligemma2 10b pt 896
googleIntroduction
PaliGemma 2 is an advanced vision-language model (VLM) developed by Google, which combines image and text inputs to generate text outputs. It is built upon the capabilities of the Gemma 2 language models and is inspired by the PaLI-3 model. This model excels in tasks like image captioning, visual question answering, and object detection, supporting multiple languages.
Architecture
PaliGemma 2 integrates a Transformer decoder and a Vision Transformer image encoder. The text decoder derives from Gemma 2, while the image encoder is initialized from the SigLIP model. It processes both images and text, producing text outputs, such as captions or answers to questions.
Training
The model was pre-trained using a diverse mixture of datasets, including WebLI, CC3M-35L, and OpenImages, among others. It utilizes advanced TPU hardware and software frameworks like JAX and Flax. The data undergoes rigorous filtering to ensure safety and responsibility, with filters for pornographic content, text safety, and personal information.
Guide: Running Locally
To run PaliGemma 2 locally, you can use the transformers
library in Python. Here are the basic steps:
- Install Dependencies: Ensure you have the
transformers
library and other dependencies installed. - Load the Model: Use
PaliGemmaForConditionalGeneration
to load the pre-trained model. - Prepare Input: Load an image and prepare text input (e.g., prompts) using
PaliGemmaProcessor
. - Generate Output: Use the model to generate text based on the input.
For efficient processing, consider using cloud GPUs, such as AWS EC2 instances or Google Cloud TPUs, especially for large-scale tasks.
License
PaliGemma 2 is distributed under the Gemma license. Users must comply with Google's usage terms and acknowledge the license on Hugging Face to access the model.