colqwen2 v0.1
vidoreIntroduction
ColQwen2 is a visual retriever model designed to index documents using visual features. It extends the Qwen2-VL-2B model with a ColBERT-style multi-vector representation approach and was introduced in the paper "ColPali: Efficient Document Retrieval with Vision Language Models."
Architecture
The model utilizes dynamic image resolutions without resizing, preserving aspect ratios. It processes up to 768 image patches for improved performance at increased memory costs. The model is built using the ColPali engine and adapts transformers with low-rank adapters and a paged AdamW optimizer.
Training
Dataset
The training dataset includes 127,460 query-page pairs from academic datasets (63%) and synthetic data from web-crawled PDF documents (37%). It focuses on English, with a validation set of 2% for hyperparameter tuning, ensuring no data contamination with evaluation datasets.
Parameters
- Epochs: 1
- Precision: bfloat16
- Adapters: LoRA, alpha=32, r=32
- Optimizer: Paged AdamW 8-bit
- Setup: 8 GPUs, data parallelism
- Learning Rate: 5e-5 with linear decay and 2.5% warmup
- Batch Size: 32
Guide: Running Locally
Ensure colpali-engine
version > 0.3.1 and transformers
version > 4.45.0 are installed.
-
Installation:
pip install git+https://github.com/illuin-tech/colpali
-
Usage:
import torch from PIL import Image from colpali_engine.models import ColQwen2, ColQwen2Processor model = ColQwen2.from_pretrained( "vidore/colqwen2-v0.1", torch_dtype=torch.bfloat16, device_map="cuda:0", ).eval() processor = ColQwen2Processor.from_pretrained("vidore/colqwen2-v0.1") images = [Image.new("RGB", (32, 32), color="white"), Image.new("RGB", (16, 16), color="black")] queries = ["Is attention really all you need?", "What is the amount of bananas farmed in Salvador?"] batch_images = processor.process_images(images).to(model.device) batch_queries = processor.process_queries(queries).to(model.device) with torch.no_grad(): image_embeddings = model(**batch_images) query_embeddings = model(**batch_queries) scores = processor.score_multi_vector(query_embeddings, image_embeddings)
Cloud GPUs: Consider using cloud GPU services like AWS, Google Cloud, or Azure for efficient model training and inference.
License
The vision language backbone model (Qwen2-VL) is licensed under Apache 2.0, while the adapters attached to the model are licensed under MIT.