Qwen2 V L 72 B Instruct
QwenIntroduction
Qwen2-VL is an advanced version of the Qwen-VL model, showcasing significant enhancements in multimodal capabilities such as image, video, and multilingual understanding. It supports complex reasoning and decision-making processes, making it suitable for integration with devices like mobile phones and robots.
Architecture
The model incorporates several architectural improvements:
- Naive Dynamic Resolution: Allows handling of images with arbitrary resolutions by mapping them into dynamic visual tokens.
- Multimodal Rotary Position Embedding (M-ROPE): Captures positional information across 1D textual, 2D visual, and 3D video inputs, enhancing multimodal processing.
- The repository contains a 72 billion parameter instruction-tuned model, with additional models available at different parameter scales.
Training
Qwen2-VL features state-of-the-art (SoTA) performance on benchmarks like MathVista and DocVQA, and supports multiple languages, including most European languages, Japanese, Korean, Arabic, and Vietnamese. It can interpret videos over 20 minutes long and can be used for video-based question answering and content creation.
Guide: Running Locally
Basic Steps
-
Installation:
Install the required libraries:pip install git+https://github.com/huggingface/transformers pip install qwen-vl-utils
-
Model Loading:
from transformers import Qwen2VLForConditionalGeneration, AutoProcessor model = Qwen2VLForConditionalGeneration.from_pretrained( "Qwen/Qwen2-VL-72B-Instruct", torch_dtype="auto", device_map="auto" ) processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-72B-Instruct")
-
Inference: Use the model for inference with images or videos by preparing messages and processing them with the model and processor.
Cloud GPUs
For optimal performance, especially in video and multi-image scenarios, consider using cloud GPUs such as AWS, Google Cloud, or Azure.
License
The model is released under the "tongyi-qianwen" license. For more details, refer to the license link.