Qwen2 V L 2 B
QwenIntroduction
The Qwen2-VL-2B model is the latest iteration of the Qwen-VL series, showcasing nearly a year of advancements. This base pretrained model excels in multimodal understanding and is designed without instruction tuning.
Architecture
Qwen2-VL-2B introduces several architectural improvements:
- Naive Dynamic Resolution: This feature allows the model to handle images of any resolution, converting them into a dynamic number of visual tokens for enhanced visual processing.
- Multimodal Rotary Position Embedding (M-ROPE): This technique decomposes positional embeddings to handle 1D textual, 2D visual, and 3D video data, boosting multimodal processing.
Training
The Qwen2-VL model family includes models with 2, 7, and 72 billion parameters. The repository contains the pretrained 2B version. Training leverages the latest features in the Hugging Face Transformers library, which should be updated to avoid errors.
Guide: Running Locally
-
Environment Setup:
- Install the latest version of the Transformers library:
pip install -U transformers
- Install the latest version of the Transformers library:
-
Model Download:
- Access the model via Hugging Face's platform or repository links.
-
Cloud GPUs:
- For optimal performance, consider using cloud GPU services like AWS, GCP, or Azure.
License
The Qwen2-VL-2B model is released under the Apache 2.0 License, permitting use, distribution, and modification under the terms of this license.