Qwen2 V L 7 B Instruct
QwenIntroduction
Qwen2-VL is an advanced vision-language model, succeeding Qwen-VL, with cutting-edge capabilities in image and video understanding, multilingual text comprehension, and integration with mobile and robotic devices. It excels in visual understanding benchmarks and supports numerous languages beyond English and Chinese.
Architecture
Qwen2-VL features several architectural enhancements:
- Naive Dynamic Resolution: Supports arbitrary image resolutions and dynamically maps them into visual tokens for flexible processing.
- Multimodal Rotary Position Embedding (M-ROPE): Captures 1D textual, 2D visual, and 3D video positional information, boosting multimodal processing capabilities.
- Available in configurations with 2, 7, and 72 billion parameters, the 7B instruction-tuned model is highlighted.
Training
Qwen2-VL achieves state-of-the-art results on multiple benchmarks for visual and video understanding, including MathVista, DocVQA, and others. The model supports multiple languages and can be integrated into devices for automatic operation based on visual and text instructions.
Guide: Running Locally
To run Qwen2-VL locally:
-
Install Dependencies:
pip install git+https://github.com/huggingface/transformers pip install qwen-vl-utils
-
Load the Model:
from transformers import Qwen2VLForConditionalGeneration, AutoProcessor model = Qwen2VLForConditionalGeneration.from_pretrained( "Qwen/Qwen2-VL-7B-Instruct", torch_dtype="auto", device_map="auto" ) processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-7B-Instruct")
-
Prepare Inputs for Inference: Configure input images or videos and use the processor to prepare the data.
-
Run Inference:
inputs = processor(text=[text], images=image_inputs, videos=video_inputs, padding=True, return_tensors="pt").to("cuda") generated_ids = model.generate(**inputs, max_new_tokens=128) output_text = processor.batch_decode(generated_ids, skip_special_tokens=True)
For optimal performance, consider using cloud GPUs like those offered by AWS, Google Cloud, or Azure.
License
Qwen2-VL is released under the Apache License 2.0, allowing for broad usage and distribution.