colpali v1.3

vidore

Introduction

ColPali is a visual retriever model based on the PaliGemma-3B architecture with a ColBERT-style strategy. It is designed for efficient document retrieval by generating multi-vector representations of text and images. This model is an extension of PaliGemma-3B, incorporating ColBERT strategies to enhance performance, as detailed in the paper "ColPali: Efficient Document Retrieval with Vision Language Models."

Architecture

ColPali utilizes a Vision Language Model (VLM) approach to index documents through their visual features. The model is built upon the SigLIP architecture, which was fine-tuned to create BiSigLIP. The image patch embeddings from SigLIP are processed by the PaliGemma-3B language model to produce BiPali, facilitating interactions between text tokens and image patches using the ColBERT strategy.

Training

The model is trained using a dataset of 127,460 query-page pairs, consisting of both academic datasets and synthetic data from web-crawled PDFs. Training is conducted with bfloat16 format using low-rank adapters, a learning rate of 5e-5 with linear decay, and a batch size of 32 over an 8 GPU setup. The model underwent training for 1 epoch, employing techniques such as in-batch negatives and hard-mined negatives.

Guide: Running Locally

To run ColPali locally, follow these steps:

  1. Install the ColPali Engine:

    pip install colpali-engine>=0.3.0,<0.4.0
    
  2. Load and Run the Model:

    import torch
    from PIL import Image
    from colpali_engine.models import ColPali, ColPaliProcessor
    
    model_name = "vidore/colpali-v1.3"
    
    model = ColPali.from_pretrained(
        model_name,
        torch_dtype=torch.bfloat16,
        device_map="cuda:0",
    ).eval()
    
    processor = ColPaliProcessor.from_pretrained(model_name)
    
    images = [Image.new("RGB", (32, 32), color="white"), Image.new("RGB", (16, 16), color="black")]
    queries = ["Is attention really all you need?", "Are Benjamin, Antoine, Merve, and Jo best friends?"]
    
    batch_images = processor.process_images(images).to(model.device)
    batch_queries = processor.process_queries(queries).to(model.device)
    
    with torch.no_grad():
        image_embeddings = model(**batch_images)
        query_embeddings = model(**batch_queries)
    
    scores = processor.score_multi_vector(query_embeddings, image_embeddings)
    
  3. Cloud GPUs: For optimal performance, consider using cloud GPU services such as AWS, Google Cloud, or Azure.

License

The ColPali model’s vision language backbone, PaliGemma, is under the Gemma license, while the adapters are licensed under the MIT license.

More Related APIs