paligemma2 3b pt 224

google

Introduction

PaliGemma 2 is an advanced vision-language model (VLM) developed by Google, built upon the capabilities of the Gemma 2 models. This model takes both image and text as input and generates text as output, supporting multiple languages. It is designed for high performance in vision-language tasks such as image captioning, visual question answering, text reading, object detection, and object segmentation.

Architecture

PaliGemma 2 integrates a Transformer decoder and a Vision Transformer image encoder. The text decoder is based on Gemma 2, available in 2B, 9B, and 27B parameter sizes, while the image encoder is initialized from SigLIP-So400m/14. This model follows the PaLI-3 training recipes, allowing it to effectively process and generate text from image and text inputs.

Training

PaliGemma 2 is pre-trained on diverse datasets like WebLI, CC3M-35L, and OpenImages, among others, using Google's Tensor Processing Unit (TPU) hardware. The training process incorporates JAX, Flax, and TFDS, ensuring efficient model training and dataset access. Comprehensive data responsibility filtering is applied to ensure safety and quality, including filtering based on content quality and safety.

Guide: Running Locally

  1. Install Required Libraries
    Ensure you have Python and install the transformers library using pip:

    pip install transformers
    
  2. Load the Model and Processor
    Use the following code snippet to load the model and processor:

    from transformers import PaliGemmaProcessor, PaliGemmaForConditionalGeneration
    import torch
    
    model_id = "google/paligemma2-3b-pt-224"
    model = PaliGemmaForConditionalGeneration.from_pretrained(model_id, torch_dtype=torch.bfloat16, device_map="auto").eval()
    processor = PaliGemmaProcessor.from_pretrained(model_id)
    
  3. Prepare Inputs and Generate Text
    Load your image and prepare inputs for the model:

    from transformers.image_utils import load_image
    
    url = "https://example.com/your-image.jpg"
    image = load_image(url)
    prompt = ""  # Your text prompt here
    model_inputs = processor(text=prompt, images=image, return_tensors="pt").to(torch.bfloat16).to(model.device)
    
  4. Run Inference
    Generate text using the model:

    with torch.inference_mode():
        generation = model.generate(**model_inputs, max_new_tokens=100, do_sample=False)
        decoded = processor.decode(generation[0][model_inputs["input_ids"].shape[-1]:], skip_special_tokens=True)
        print(decoded)
    
  5. Cloud GPU Recommendation
    For efficient performance, especially with large models, consider using cloud GPU services like AWS, Google Cloud, or Azure.

License

The PaliGemma 2 model is released under the Gemma license. Users must review and agree to Google's usage license to access the model on Hugging Face.

More Related APIs in Image Text To Text