Mono Qwen2 V L v0.1

lightonai

Introduction

The MonoQwen2-VL-v0.1 model is a multimodal reranker optimized for determining the relevance of an image to a query. It is finetuned from the Qwen2-VL-2B model using LoRA, with its performance enhanced by the MonoT5 objective.

Architecture

MonoQwen2-VL-v0.1 is designed to assess the pointwise relevance between image-query pairs. It outputs "True" if the image is relevant to the query and "False" otherwise. The model calculates a relevancy score by comparing logits of these tokens, enabling effective reranking of candidates provided by a first-stage retriever.

Training

The model was trained using the ColPali dataset, with negatives mined using DSE. It is evaluated on the ViDoRe Benchmark, showcasing improved ndcg@5 scores across various datasets after reranking with MonoQwen2-VL-v0.1.

Guide: Running Locally

  1. Install Dependencies:

    • Install torch, transformers, and peft via pip if not already installed.
  2. Load the Model and Processor:

    from transformers import AutoProcessor, Qwen2VLForConditionalGeneration
    processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-2B-Instruct")
    model = Qwen2VLForConditionalGeneration.from_pretrained("lightonai/MonoQwen2-VL-v0.1", device_map="auto")
    
  3. Prepare Input and Inference:

    • Define a query and load an image.
    • Construct the prompt, apply the chat template, and prepare the input.
    • Use the model to obtain logits and calculate relevance scores.
  4. Cloud GPU Suggestions:

    • Consider using cloud services such as AWS EC2, Google Cloud Platform, or Azure for efficient GPU resources.

License

The MonoQwen2-VL-v0.1 model is licensed under the Apache 2.0 license.

More Related APIs