Mono Qwen2 V L v0.1
lightonaiIntroduction
The MonoQwen2-VL-v0.1 model is a multimodal reranker optimized for determining the relevance of an image to a query. It is finetuned from the Qwen2-VL-2B model using LoRA, with its performance enhanced by the MonoT5 objective.
Architecture
MonoQwen2-VL-v0.1 is designed to assess the pointwise relevance between image-query pairs. It outputs "True" if the image is relevant to the query and "False" otherwise. The model calculates a relevancy score by comparing logits of these tokens, enabling effective reranking of candidates provided by a first-stage retriever.
Training
The model was trained using the ColPali dataset, with negatives mined using DSE. It is evaluated on the ViDoRe Benchmark, showcasing improved ndcg@5 scores across various datasets after reranking with MonoQwen2-VL-v0.1.
Guide: Running Locally
-
Install Dependencies:
- Install
torch
,transformers
, andpeft
via pip if not already installed.
- Install
-
Load the Model and Processor:
from transformers import AutoProcessor, Qwen2VLForConditionalGeneration processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-2B-Instruct") model = Qwen2VLForConditionalGeneration.from_pretrained("lightonai/MonoQwen2-VL-v0.1", device_map="auto")
-
Prepare Input and Inference:
- Define a query and load an image.
- Construct the prompt, apply the chat template, and prepare the input.
- Use the model to obtain logits and calculate relevance scores.
-
Cloud GPU Suggestions:
- Consider using cloud services such as AWS EC2, Google Cloud Platform, or Azure for efficient GPU resources.
License
The MonoQwen2-VL-v0.1 model is licensed under the Apache 2.0 license.