colpali v1.2
vidoreIntroduction
ColPali is a model designed for efficient document retrieval using Vision Language Models (VLMs). It extends the PaliGemma-3B architecture to generate multi-vector representations of text and images in a ColBERT style. This approach enhances document indexing by leveraging both visual and textual data.
Architecture
The model builds upon an initial SigLIP model, which is finetuned to create BiSigLIP. Image patch embeddings from SigLIP are input to PaliGemma-3B, resulting in BiPali. This allows image and text data to share a latent space, improving performance by enabling interaction between text tokens and image patches.
Training
Dataset
ColPali's training dataset includes 127,460 query-page pairs, consisting of 63% academic datasets and 37% synthetic data from web-crawled PDFs. This dataset is intentionally English-centric to test zero-shot generalization to other languages. A validation set (2% of samples) is used for hyperparameter tuning.
Parameters
Models are trained for one epoch using bfloat16 format. Low-rank adapters (LoRA) are applied to transformer layers and the projection layer, with training done on an 8-GPU setup. The learning rate is set at 5e-5 with linear decay, and a batch size of 32 is used.
Guide: Running Locally
-
Install Dependencies
Ensurecolpali-engine
is installed using the command:pip install colpali-engine>=0.3.0,<0.4.0
-
Load the Model
Use the following Python code to load and use the model:import torch from PIL import Image from colpali_engine.models import ColPali, ColPaliProcessor model_name = "vidore/colpali-v1.2" model = ColPali.from_pretrained(model_name, torch_dtype=torch.bfloat16, device_map="cuda:0").eval() processor = ColPaliProcessor.from_pretrained(model_name) images = [Image.new("RGB", (32, 32), color="white"), Image.new("RGB", (16, 16), color="black")] queries = ["Is attention really all you need?", "Are Benjamin, Antoine, Merve, and Jo best friends?"] batch_images = processor.process_images(images).to(model.device) batch_queries = processor.process_queries(queries).to(model.device) with torch.no_grad(): image_embeddings = model(**batch_images) query_embeddings = model(**batch_queries) scores = processor.score_multi_vector(query_embeddings, image_embeddings)
-
Cloud GPUs
For optimal performance, consider using cloud services like AWS or Google Cloud for GPU resources.
License
ColPali's backbone model, PaliGemma, is licensed under the Gemma license, while the adapters are available under the MIT license.