colpali v1.3
vidoreIntroduction
ColPali is a visual retriever model based on the PaliGemma-3B architecture with a ColBERT-style strategy. It is designed for efficient document retrieval by generating multi-vector representations of text and images. This model is an extension of PaliGemma-3B, incorporating ColBERT strategies to enhance performance, as detailed in the paper "ColPali: Efficient Document Retrieval with Vision Language Models."
Architecture
ColPali utilizes a Vision Language Model (VLM) approach to index documents through their visual features. The model is built upon the SigLIP architecture, which was fine-tuned to create BiSigLIP. The image patch embeddings from SigLIP are processed by the PaliGemma-3B language model to produce BiPali, facilitating interactions between text tokens and image patches using the ColBERT strategy.
Training
The model is trained using a dataset of 127,460 query-page pairs, consisting of both academic datasets and synthetic data from web-crawled PDFs. Training is conducted with bfloat16 format using low-rank adapters, a learning rate of 5e-5 with linear decay, and a batch size of 32 over an 8 GPU setup. The model underwent training for 1 epoch, employing techniques such as in-batch negatives and hard-mined negatives.
Guide: Running Locally
To run ColPali locally, follow these steps:
-
Install the ColPali Engine:
pip install colpali-engine>=0.3.0,<0.4.0
-
Load and Run the Model:
import torch from PIL import Image from colpali_engine.models import ColPali, ColPaliProcessor model_name = "vidore/colpali-v1.3" model = ColPali.from_pretrained( model_name, torch_dtype=torch.bfloat16, device_map="cuda:0", ).eval() processor = ColPaliProcessor.from_pretrained(model_name) images = [Image.new("RGB", (32, 32), color="white"), Image.new("RGB", (16, 16), color="black")] queries = ["Is attention really all you need?", "Are Benjamin, Antoine, Merve, and Jo best friends?"] batch_images = processor.process_images(images).to(model.device) batch_queries = processor.process_queries(queries).to(model.device) with torch.no_grad(): image_embeddings = model(**batch_images) query_embeddings = model(**batch_queries) scores = processor.score_multi_vector(query_embeddings, image_embeddings)
-
Cloud GPUs: For optimal performance, consider using cloud GPU services such as AWS, Google Cloud, or Azure.
License
The ColPali model’s vision language backbone, PaliGemma, is under the Gemma license, while the adapters are licensed under the MIT license.