colqwen2 v0.1 LLM Model — Open LLM List

Introduction

ColQwen2 is a visual retriever model designed to index documents using visual features. It extends the Qwen2-VL-2B model with a ColBERT-style multi-vector representation approach and was introduced in the paper "ColPali: Efficient Document Retrieval with Vision Language Models."

Architecture

The model utilizes dynamic image resolutions without resizing, preserving aspect ratios. It processes up to 768 image patches for improved performance at increased memory costs. The model is built using the ColPali engine and adapts transformers with low-rank adapters and a paged AdamW optimizer.

Training

Dataset

The training dataset includes 127,460 query-page pairs from academic datasets (63%) and synthetic data from web-crawled PDF documents (37%). It focuses on English, with a validation set of 2% for hyperparameter tuning, ensuring no data contamination with evaluation datasets.

Parameters

Epochs: 1
Precision: bfloat16
Adapters: LoRA, alpha=32, r=32
Optimizer: Paged AdamW 8-bit
Setup: 8 GPUs, data parallelism
Learning Rate: 5e-5 with linear decay and 2.5% warmup
Batch Size: 32

Guide: Running Locally

Ensure colpali-engine version > 0.3.1 and transformers version > 4.45.0 are installed.

Installation:

pip install git+https://github.com/illuin-tech/colpali

Usage:

import torch
from PIL import Image
from colpali_engine.models import ColQwen2, ColQwen2Processor

model = ColQwen2.from_pretrained(
        "vidore/colqwen2-v0.1",
        torch_dtype=torch.bfloat16,
        device_map="cuda:0",
    ).eval()
processor = ColQwen2Processor.from_pretrained("vidore/colqwen2-v0.1")

images = [Image.new("RGB", (32, 32), color="white"), Image.new("RGB", (16, 16), color="black")]
queries = ["Is attention really all you need?", "What is the amount of bananas farmed in Salvador?"]

batch_images = processor.process_images(images).to(model.device)
batch_queries = processor.process_queries(queries).to(model.device)

with torch.no_grad():
    image_embeddings = model(**batch_images)
    query_embeddings = model(**batch_queries)

scores = processor.score_multi_vector(query_embeddings, image_embeddings)

Cloud GPUs: Consider using cloud GPU services like AWS, Google Cloud, or Azure for efficient model training and inference.

License

The vision language backbone model (Qwen2-VL) is licensed under Apache 2.0, while the adapters attached to the model are licensed under MIT.

More Related APIs