V L M2 Vec Full
TIGER-LabIntroduction
VLM2Vec is a framework designed for training vision-language models to handle massive multimodal embedding tasks. It builds upon existing vision-language models (VLMs), specifically converting the Phi-3.5-V model into a versatile embedding model. The goal is to create a unified model capable of performing various tasks through multimodal embeddings.
Architecture
The VLM2Vec model is based on the Phi-3.5-vision-instruct backbone. It utilizes PyTorch and the Transformers library for implementation. The architecture focuses on multimodal embeddings, allowing for the integration of both visual and textual data to enhance task performance.
Training
VLM2Vec is trained using a dataset called MMEB-train and evaluated on MMEB-eval. The training process leverages contrastive learning with in-batch negatives. The model was trained using LoRA (Low-rank Adaptation) with a batch size of 1024 and also has a checkpoint for full training with a batch size of 2048. The training emphasizes optimizing performance across 36 evaluation datasets.
Guide: Running Locally
-
Clone the Repository:
git clone https://github.com/TIGER-AI-Lab/VLM2Vec.git
-
Install Requirements:
Navigate to the cloned directory and install the necessary packages:pip install -r requirements.txt
-
Set Up the Model:
You can utilize code snippets provided in the README to load, process, and evaluate the model. Ensure that you have a suitable GPU setup, such as NVIDIA's CUDA, to run the model efficiently. -
Execution:
Use the provided Python code to process images and text, and compute similarities. Make sure to load the model onto a GPU:model = model.to('cuda', dtype=torch.bfloat16)
-
Cloud GPUs:
Consider using cloud services like AWS, Google Cloud, or Azure for access to powerful GPUs if local resources are limited.
License
The VLM2Vec project is released under the Apache 2.0 License. This allows for open-source use, modification, and distribution, provided that the original license terms are adhered to.