colpali
vidoreIntroduction
ColPali is a visual retriever model that efficiently indexes documents using visual features, built on the PaliGemma-3B architecture with a ColBERT strategy. It's designed for document retrieval with vision language models (VLMs), providing multi-vector representations of text and images.
Architecture
The model builds upon an initial SigLIP model, fine-tuned to BiSigLIP, and further developed by integrating it with PaliGemma-3B to form BiPali. The architecture leverages the ColBERT strategy to compute interactions between text tokens and image patches, resulting in enhanced performance.
Training
Dataset
The training dataset comprises 127,460 query-page pairs, with 63% from academic datasets and 37% from a synthetic dataset of web-crawled PDFs. It is designed in English to assess zero-shot generalization capabilities. A validation set, comprising 2% of the samples, is used for tuning hyperparameters.
Parameters
The model is trained for one epoch using bfloat16 format and low-rank adapters (LoRA), with specific configurations for transformer layers and a paged_adamw_8bit optimizer. Training occurs on an 8 GPU setup using data parallelism, with a learning rate of 5e-5 and a batch size of 32.
Guide: Running Locally
-
Install the required package:
pip install colpali_engine==0.1.1
-
Set up the environment:
- Ensure your system has a CUDA-enabled GPU for optimal performance.
-
Run the example script:
- Use the provided Python script to load the model, process images, and run inference.
- Install necessary Python packages such as
torch
,typer
, andtransformers
.
-
Cloud GPUs:
- Consider using cloud services like AWS EC2, Google Cloud, or Azure for access to powerful GPUs if local resources are insufficient.
License
ColPali's backbone model (PaliGemma) follows the Gemma license, while the adapters are licensed under the MIT license.