colpali v1.2

vidore

Introduction

ColPali is a model designed for efficient document retrieval using Vision Language Models (VLMs). It extends the PaliGemma-3B architecture to generate multi-vector representations of text and images in a ColBERT style. This approach enhances document indexing by leveraging both visual and textual data.

Architecture

The model builds upon an initial SigLIP model, which is finetuned to create BiSigLIP. Image patch embeddings from SigLIP are input to PaliGemma-3B, resulting in BiPali. This allows image and text data to share a latent space, improving performance by enabling interaction between text tokens and image patches.

Training

Dataset

ColPali's training dataset includes 127,460 query-page pairs, consisting of 63% academic datasets and 37% synthetic data from web-crawled PDFs. This dataset is intentionally English-centric to test zero-shot generalization to other languages. A validation set (2% of samples) is used for hyperparameter tuning.

Parameters

Models are trained for one epoch using bfloat16 format. Low-rank adapters (LoRA) are applied to transformer layers and the projection layer, with training done on an 8-GPU setup. The learning rate is set at 5e-5 with linear decay, and a batch size of 32 is used.

Guide: Running Locally

  1. Install Dependencies
    Ensure colpali-engine is installed using the command:

    pip install colpali-engine>=0.3.0,<0.4.0
    
  2. Load the Model
    Use the following Python code to load and use the model:

    import torch
    from PIL import Image
    from colpali_engine.models import ColPali, ColPaliProcessor
    
    model_name = "vidore/colpali-v1.2"
    model = ColPali.from_pretrained(model_name, torch_dtype=torch.bfloat16, device_map="cuda:0").eval()
    processor = ColPaliProcessor.from_pretrained(model_name)
    
    images = [Image.new("RGB", (32, 32), color="white"), Image.new("RGB", (16, 16), color="black")]
    queries = ["Is attention really all you need?", "Are Benjamin, Antoine, Merve, and Jo best friends?"]
    
    batch_images = processor.process_images(images).to(model.device)
    batch_queries = processor.process_queries(queries).to(model.device)
    
    with torch.no_grad():
        image_embeddings = model(**batch_images)
        query_embeddings = model(**batch_queries)
    
    scores = processor.score_multi_vector(query_embeddings, image_embeddings)
    
  3. Cloud GPUs
    For optimal performance, consider using cloud services like AWS or Google Cloud for GPU resources.

License

ColPali's backbone model, PaliGemma, is licensed under the Gemma license, while the adapters are available under the MIT license.

More Related APIs