colqwen2 v1.0
vidoreIntroduction
ColQwen2 is a model designed for efficient document retrieval using Vision Language Models (VLMs). It extends the Qwen2-VL-2B architecture with a ColBERT-style approach to generate multi-vector representations of text and images. This model is intended to index documents based on their visual features.
Architecture
ColQwen2 operates on dynamic image resolutions, maintaining aspect ratios without resizing. The architecture supports up to 768 image patches, balancing image detail with memory use. It employs the ColBERT strategy for multi-vector retrieval, enhancing performance on PDF-type documents.
Training
The model is trained on 127,460 query-page pairs, consisting of 63% academic datasets and 37% synthetic, web-crawled PDF pages with VLM-generated pseudo-questions. The training is conducted in English, aiming to explore zero-shot generalization to other languages. The model is trained for one epoch using bfloat16 format with LoRA adapters on an 8-GPU setup. The learning rate is set to 5e-5, with data parallelism and a batch size of 32.
Guide: Running Locally
- Install Dependencies: Ensure
colpali-engine
(version > 0.3.4) andtransformers
(version > 4.46.1) are installed.pip install git+https://github.com/illuin-tech/colpali
- Setup the Model:
import torch from PIL import Image from colpali_engine.models import ColQwen2, ColQwen2Processor model = ColQwen2.from_pretrained("vidore/colqwen2-v1.0", torch_dtype=torch.bfloat16, device_map="cuda:0").eval() processor = ColQwen2Processor.from_pretrained("vidore/colqwen2-v1.0")
- Process Inputs:
images = [Image.new("RGB", (32, 32), color="white"), Image.new("RGB", (16, 16), color="black")] queries = ["Is attention really all you need?", "What is the amount of bananas farmed in Salvador?"] batch_images = processor.process_images(images).to(model.device) batch_queries = processor.process_queries(queries).to(model.device) with torch.no_grad(): image_embeddings = model(**batch_images) query_embeddings = model(**batch_queries) scores = processor.score_multi_vector(query_embeddings, image_embeddings)
- Hardware Recommendation: Use a cloud GPU service like AWS, GCP, or Azure to efficiently run the model, especially for larger datasets.
License
The ColQwen2's vision language backbone (Qwen2-VL) is licensed under Apache 2.0, while the adapters are licensed under MIT.