colpali v1.2 LLM Model — Open LLM List

Introduction

ColPali is a model designed for efficient document retrieval using Vision Language Models (VLMs). It extends the PaliGemma-3B architecture to generate multi-vector representations of text and images in a ColBERT style. This approach enhances document indexing by leveraging both visual and textual data.

Architecture

The model builds upon an initial SigLIP model, which is finetuned to create BiSigLIP. Image patch embeddings from SigLIP are input to PaliGemma-3B, resulting in BiPali. This allows image and text data to share a latent space, improving performance by enabling interaction between text tokens and image patches.

Training

Dataset

ColPali's training dataset includes 127,460 query-page pairs, consisting of 63% academic datasets and 37% synthetic data from web-crawled PDFs. This dataset is intentionally English-centric to test zero-shot generalization to other languages. A validation set (2% of samples) is used for hyperparameter tuning.

Parameters

Models are trained for one epoch using bfloat16 format. Low-rank adapters (LoRA) are applied to transformer layers and the projection layer, with training done on an 8-GPU setup. The learning rate is set at 5e-5 with linear decay, and a batch size of 32 is used.

Guide: Running Locally

Install Dependencies
Ensure colpali-engine is installed using the command:
```
pip install colpali-engine>=0.3.0,<0.4.0
```

Load the Model
Use the following Python code to load and use the model:

import torch
from PIL import Image
from colpali_engine.models import ColPali, ColPaliProcessor

model_name = "vidore/colpali-v1.2"
model = ColPali.from_pretrained(model_name, torch_dtype=torch.bfloat16, device_map="cuda:0").eval()
processor = ColPaliProcessor.from_pretrained(model_name)

images = [Image.new("RGB", (32, 32), color="white"), Image.new("RGB", (16, 16), color="black")]
queries = ["Is attention really all you need?", "Are Benjamin, Antoine, Merve, and Jo best friends?"]

batch_images = processor.process_images(images).to(model.device)
batch_queries = processor.process_queries(queries).to(model.device)

with torch.no_grad():
    image_embeddings = model(**batch_images)
    query_embeddings = model(**batch_queries)

scores = processor.score_multi_vector(query_embeddings, image_embeddings)

Cloud GPUs
For optimal performance, consider using cloud services like AWS or Google Cloud for GPU resources.

License

ColPali's backbone model, PaliGemma, is licensed under the Gemma license, while the adapters are available under the MIT license.

More Related APIs