Qwen2 V L 7 B Instruct G P T Q Int4
QwenIntroduction
Qwen2-VL-7B-Instruct-GPTQ-Int4 is the latest iteration of the Qwen-VL model, showcasing advancements in visual understanding and multimodal processing. It offers state-of-the-art capabilities in processing and understanding images, videos, and multilingual text, making it suitable for various applications, including mobile and robotic integrations.
Architecture
Key architectural updates include:
- Naive Dynamic Resolution: Handles arbitrary image resolutions, mapping them into dynamic visual tokens.
- Multimodal Rotary Position Embedding (M-ROPE): Enhances processing capabilities by capturing positional information across textual, visual, and video dimensions.
The repository includes the instruction-tuned 7B parameter model, with additional configurations available for different use cases.
Training
The Qwen2-VL series demonstrates quantized models' performance across various benchmarks, such as MMMU_VAL, DocVQA_VAL, MMBench_DEV_EN, and MathVista_MINI. Speed benchmarks indicate its efficient inference capabilities on GPUs like the NVIDIA A100, with quantizations such as BF16, GPTQ-Int8, GPTQ-Int4, and AWQ enhancing performance.
Guide: Running Locally
-
Install Dependencies:
- Ensure the latest version of Hugging Face transformers is installed:
pip install git+https://github.com/huggingface/transformers
- Install the
qwen-vl-utils
for handling various visual inputs:pip install qwen-vl-utils
- Ensure the latest version of Hugging Face transformers is installed:
-
Set Up the Model:
from transformers import Qwen2VLForConditionalGeneration, AutoProcessor model = Qwen2VLForConditionalGeneration.from_pretrained( "Qwen/Qwen2-VL-7B-Instruct-GPTQ-Int4", torch_dtype="auto", device_map="auto" ) processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-7B-Instruct-GPTQ-Int4")
-
Prepare Inputs and Run Inference:
- Process messages containing text and images/videos using
qwen_vl_utils
or manually. - Send inputs to the model and decode the outputs.
- Process messages containing text and images/videos using
For optimal performance, particularly with larger inputs, consider using cloud GPUs like the NVIDIA A100.
License
The model is licensed under the Apache 2.0 License, allowing for broad use and distribution.