paligemma2 28b pt 896

google

Introduction

PaliGemma 2 is a vision-language model (VLM) that integrates the capabilities of the PaliGemma and Gemma 2 models. It is designed for fine-tuning on a variety of vision-language tasks like image captioning, visual question answering, object detection, and segmentation. It supports multi-language inputs and outputs, making it versatile across different domains.

Architecture

PaliGemma 2 features a Transformer decoder and a Vision Transformer image encoder. The text decoder initializes from Gemma 2 at various parameter sizes (2B, 9B, and 27B), while the image encoder initializes from SigLIP-So400m/14. The model processes both image and text inputs to generate text outputs, supporting tasks like caption generation and object detection.

Training

PaliGemma 2 is pre-trained on a mixture of datasets, including WebLI, CC3M-35L, VQ²A-CC3M-35L, OpenImages, and WIT. The model uses responsible data filtering methods to ensure safety and appropriateness, filtering for pornographic, toxic, or sensitive content. Training utilizes the latest TPU hardware and software frameworks like JAX, Flax, and TFDS.

Guide: Running Locally

  1. Setup Environment: Install the transformers library and required dependencies using pip install transformers.
  2. Load Model: Use the provided code snippet to load and run the model.
    from transformers import PaliGemmaProcessor, PaliGemmaForConditionalGeneration
    from transformers.image_utils import load_image
    import torch
    
    model_id = "google/paligemma2-28b-pt-896"
    model = PaliGemmaForConditionalGeneration.from_pretrained(model_id, torch_dtype=torch.bfloat16, device_map="auto").eval()
    processor = PaliGemmaProcessor.from_pretrained(model_id)
    
    url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg"
    image = load_image(url)
    
    prompt = ""
    model_inputs = processor(text=prompt, images=image, return_tensors="pt").to(torch.bfloat16).to(model.device)
    
    with torch.inference_mode():
        generation = model.generate(**model_inputs, max_new_tokens=100, do_sample=False)
        decoded = processor.decode(generation[0][model_inputs["input_ids"].shape[-1]:], skip_special_tokens=True)
        print(decoded)
    
  3. Fine-Tuning: Consider fine-tuning the model on specific tasks for improved performance.
  4. Cloud GPUs: For efficient computation, especially with large models, consider using cloud-based GPU services such as AWS, Google Cloud, or Azure.

License

The PaliGemma 2 model is distributed under the "gemma" license. Users must review and agree to Google's usage terms before accessing the model.

More Related APIs in Image Text To Text