paligemma2 3b ft docci 448 jax
googleIntroduction
PaliGemma 2 is an advanced vision-language model (VLM) designed to handle image and text inputs and generate text outputs across various languages. It enhances the capabilities of the previous PaliGemma model by incorporating features from the Gemma 2 models, allowing for superior performance in tasks like image captioning, visual question answering, and object detection.
Architecture
PaliGemma 2 combines a Transformer decoder and a Vision Transformer image encoder. The text decoder is initialized from the Gemma 2 models, and the image encoder is built on SigLIP-So400m/14. This architecture facilitates the model's ability to process and generate text from both image and text inputs.
Training
PaliGemma 2 is trained using JAX, Flax, and TFDS on TPUs, leveraging datasets like WebLI, CC3M-35L, and OpenImages. It employs filtering techniques to ensure data responsibility, removing inappropriate content and personal information. The model is fine-tuned with 448x448 input images on the DOCCI dataset, and it supports transfer to a wide range of vision-language tasks.
Guide: Running Locally
- Authenticate using the Hugging Face CLI:
huggingface-cli login
- Download the model weights:
This will save the weights to thehuggingface-cli download --local-dir models google/paligemma2-3b-ft-docci-448-jax
models
directory. - Set up the environment: Ensure you have JAX and Flax installed.
- Run the model using the big_vision codebase for inference and fine-tuning.
For optimal performance, consider using cloud GPUs, such as those available on Google Cloud or AWS.
License
The PaliGemma 2 model is distributed under the Gemma license. Users must review and agree to Google's usage license before accessing the model through Hugging Face.