colqwen2 v1.0

vidore

Introduction

ColQwen2 is a model designed for efficient document retrieval using Vision Language Models (VLMs). It extends the Qwen2-VL-2B architecture with a ColBERT-style approach to generate multi-vector representations of text and images. This model is intended to index documents based on their visual features.

Architecture

ColQwen2 operates on dynamic image resolutions, maintaining aspect ratios without resizing. The architecture supports up to 768 image patches, balancing image detail with memory use. It employs the ColBERT strategy for multi-vector retrieval, enhancing performance on PDF-type documents.

Training

The model is trained on 127,460 query-page pairs, consisting of 63% academic datasets and 37% synthetic, web-crawled PDF pages with VLM-generated pseudo-questions. The training is conducted in English, aiming to explore zero-shot generalization to other languages. The model is trained for one epoch using bfloat16 format with LoRA adapters on an 8-GPU setup. The learning rate is set to 5e-5, with data parallelism and a batch size of 32.

Guide: Running Locally

  1. Install Dependencies: Ensure colpali-engine (version > 0.3.4) and transformers (version > 4.46.1) are installed.
    pip install git+https://github.com/illuin-tech/colpali
    
  2. Setup the Model:
    import torch
    from PIL import Image
    from colpali_engine.models import ColQwen2, ColQwen2Processor
    
    model = ColQwen2.from_pretrained("vidore/colqwen2-v1.0", torch_dtype=torch.bfloat16, device_map="cuda:0").eval()
    processor = ColQwen2Processor.from_pretrained("vidore/colqwen2-v1.0")
    
  3. Process Inputs:
    images = [Image.new("RGB", (32, 32), color="white"), Image.new("RGB", (16, 16), color="black")]
    queries = ["Is attention really all you need?", "What is the amount of bananas farmed in Salvador?"]
    
    batch_images = processor.process_images(images).to(model.device)
    batch_queries = processor.process_queries(queries).to(model.device)
    
    with torch.no_grad():
        image_embeddings = model(**batch_images)
        query_embeddings = model(**batch_queries)
    
    scores = processor.score_multi_vector(query_embeddings, image_embeddings)
    
  4. Hardware Recommendation: Use a cloud GPU service like AWS, GCP, or Azure to efficiently run the model, especially for larger datasets.

License

The ColQwen2's vision language backbone (Qwen2-VL) is licensed under Apache 2.0, while the adapters are licensed under MIT.

More Related APIs