colqwen2 v0.1

vidore

Introduction

ColQwen2 is a visual retriever model designed to index documents using visual features. It extends the Qwen2-VL-2B model with a ColBERT-style multi-vector representation approach and was introduced in the paper "ColPali: Efficient Document Retrieval with Vision Language Models."

Architecture

The model utilizes dynamic image resolutions without resizing, preserving aspect ratios. It processes up to 768 image patches for improved performance at increased memory costs. The model is built using the ColPali engine and adapts transformers with low-rank adapters and a paged AdamW optimizer.

Training

Dataset

The training dataset includes 127,460 query-page pairs from academic datasets (63%) and synthetic data from web-crawled PDF documents (37%). It focuses on English, with a validation set of 2% for hyperparameter tuning, ensuring no data contamination with evaluation datasets.

Parameters

  • Epochs: 1
  • Precision: bfloat16
  • Adapters: LoRA, alpha=32, r=32
  • Optimizer: Paged AdamW 8-bit
  • Setup: 8 GPUs, data parallelism
  • Learning Rate: 5e-5 with linear decay and 2.5% warmup
  • Batch Size: 32

Guide: Running Locally

Ensure colpali-engine version > 0.3.1 and transformers version > 4.45.0 are installed.

  1. Installation:

    pip install git+https://github.com/illuin-tech/colpali
    
  2. Usage:

    import torch
    from PIL import Image
    from colpali_engine.models import ColQwen2, ColQwen2Processor
    
    model = ColQwen2.from_pretrained(
            "vidore/colqwen2-v0.1",
            torch_dtype=torch.bfloat16,
            device_map="cuda:0",
        ).eval()
    processor = ColQwen2Processor.from_pretrained("vidore/colqwen2-v0.1")
    
    images = [Image.new("RGB", (32, 32), color="white"), Image.new("RGB", (16, 16), color="black")]
    queries = ["Is attention really all you need?", "What is the amount of bananas farmed in Salvador?"]
    
    batch_images = processor.process_images(images).to(model.device)
    batch_queries = processor.process_queries(queries).to(model.device)
    
    with torch.no_grad():
        image_embeddings = model(**batch_images)
        query_embeddings = model(**batch_queries)
    
    scores = processor.score_multi_vector(query_embeddings, image_embeddings)
    

Cloud GPUs: Consider using cloud GPU services like AWS, Google Cloud, or Azure for efficient model training and inference.

License

The vision language backbone model (Qwen2-VL) is licensed under Apache 2.0, while the adapters attached to the model are licensed under MIT.

More Related APIs