colpali v1.3 hf
vidoreIntroduction
ColPali is a model designed for efficient document retrieval using Vision Language Models (VLMs). It extends the PaliGemma-3B model to generate ColBERT-style multi-vector representations of text and images. The model was introduced in the paper "ColPali: Efficient Document Retrieval with Vision Language Models" and is implemented in the Hugging Face transformers library.
Architecture
ColPali builds upon the SigLIP model, which is fine-tuned into BiSigLIP, and further extends it by feeding patch-embeddings to the PaliGemma-3B language model, creating BiPali. This architecture allows the model to map image patch embeddings to a latent space similar to textual inputs, leveraging the ColBERT strategy for improved interaction between text tokens and image patches.
Training
The training dataset consists of 127,460 query-page pairs, combining academic datasets and synthetic datasets from web-crawled PDF documents. The model is trained for one epoch using bfloat16 format, low-rank adapters, and a paged_adamw_8bit optimizer on an 8 GPU setup. Training settings include a learning rate of 5e-5 with linear decay, 2.5% warmup steps, and a batch size of 32.
Guide: Running Locally
- Install Dependencies: Ensure you have Python and PyTorch installed. Use
pip
to install thetransformers
library. - Load Model:
from transformers import ColPaliForRetrieval, ColPaliProcessor model = ColPaliForRetrieval.from_pretrained("vidore/colpali-v1.3-hf") processor = ColPaliProcessor.from_pretrained("vidore/colpali-v1.3-hf")
- Prepare Data: Use the PIL library to handle image inputs and create queries as text strings.
- Inference: Process inputs and run the forward pass to obtain embeddings and scores.
- Cloud GPUs: For enhanced performance, consider using cloud services like AWS EC2 or Google Cloud Platform that offer GPU instances.
License
ColPali's vision language backbone model, PaliGemma, is licensed under the gemma license. The model's adapters are licensed under the MIT license.