Qwen2 V L 2 B Instruct
QwenIntroduction
Qwen2-VL is the latest version of the Qwen-VL model, showcasing nearly a year of advancements. It offers state-of-the-art capabilities in visual understanding across various benchmarks, long video comprehension, multilingual support, and integration with devices for automatic operations. Enhancements include handling arbitrary image resolutions and improved positional embedding for multimodal data.
Architecture
Qwen2-VL features Naive Dynamic Resolution, allowing it to manage images of any resolution by mapping them into a dynamic number of visual tokens. It also utilizes Multimodal Rotary Position Embedding (M-ROPE) to effectively process 1D textual, 2D visual, and 3D video positional information. The model is available in configurations with 2, 7, and 72 billion parameters, with the repository containing the instruction-tuned 2B model.
Training
The model is evaluated against several benchmarks, demonstrating strong performance in image and video tasks. The training requirements include installing the latest Hugging Face transformers library and the qwen-vl-utils package for handling various visual inputs.
Guide: Running Locally
Basic Steps
-
Install Required Libraries:
pip install git+https://github.com/huggingface/transformers pip install qwen-vl-utils
-
Load the Model and Processor:
from transformers import Qwen2VLForConditionalGeneration, AutoProcessor model = Qwen2VLForConditionalGeneration.from_pretrained( "Qwen/Qwen2-VL-2B-Instruct", torch_dtype="auto", device_map="auto" ) processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-2B-Instruct")
-
Prepare Inputs for Inference:
- Use
qwen_vl_utils
for processing vision information. - Set up messages and prepare inputs using
AutoProcessor
.
- Use
-
Run Inference:
generated_ids = model.generate(**inputs, max_new_tokens=128) output_text = processor.batch_decode(generated_ids, skip_special_tokens=True) print(output_text)
Cloud GPUs
For optimal performance, especially in scenarios involving multiple images or videos, it is recommended to use cloud GPUs such as AWS EC2 instances with GPU support, Google Cloud GPU offerings, or Azure GPU virtual machines.
License
Qwen2-VL is available under the Apache-2.0 license, allowing for wide use and modification with proper attribution.