gme Qwen2 V L 2 B Instruct
Alibaba-NLPIntroduction
GME-Qwen2-VL-2B is a part of the GME-Qwen2VL series of unified multimodal embedding models, developed by Tongyi Lab, Alibaba Group. These models can process text, images, and image-text pairs, generating universal vector representations for diverse retrieval tasks. The models have demonstrated high performance in the Universal Multimodal Retrieval Benchmark (UMRB) and the Multimodal Textual Evaluation Benchmark (MTEB).
Architecture
The GME models feature:
- Unified Multimodal Representation: Capable of handling both single and combined-modal inputs for versatile retrieval scenarios.
- High Performance: Achieves state-of-the-art results on retrieval benchmarks.
- Dynamic Image Resolution: Supports variable image resolutions.
- Strong Visual Retrieval Performance: Excels in complex document understanding tasks.
Training
The model is trained using English data, despite the Qwen2-VL models being multilingual. The training process limits visual token count to 1024 for efficiency.
Guide: Running Locally
-
Installation Requirements:
- Ensure Python and relevant libraries like PyTorch are installed.
- Install the GME-Qwen2-VL-2B model from Hugging Face.
-
Usage Example:
from gme_inference import GmeQwen2VL texts = ["What kind of car is this?", "The Tesla Cybertruck is a battery electric pickup truck built by Tesla, Inc. since 2023."] images = ['https://example.com/image1.jpg', 'https://example.com/image2.jpg'] gme = GmeQwen2VL("Alibaba-NLP/gme-Qwen2-VL-2B-Instruct") e_text = gme.get_text_embeddings(texts=texts) e_image = gme.get_image_embeddings(images=images) print((e_text * e_image).sum(-1))
-
Cloud GPUs:
- Consider using cloud GPU services like AWS, Google Cloud, or Alibaba Cloud for optimal performance.
License
The model is licensed under the Apache 2.0 License. Redistribution and use require prominent attribution, and any derivative AI models must prefix their name with "GME".