Q V Q 72 B Preview
QwenIntroduction
QVQ-72B-Preview is an experimental research model developed by the Qwen team to enhance visual reasoning capabilities.
Architecture
The model is based on the Qwen2-VL-72B architecture, utilizing transformers for image-text-to-text processing. It demonstrates strong performance in multidisciplinary understanding and reasoning.
Training
QVQ-72B-Preview is designed to improve visual reasoning and handles various input types such as base64, URLs, and interleaved images and videos. However, it currently only supports single-round dialogues and image outputs, not video inputs.
Model Stats Number
The model has achieved a score of 70.3% on the Multimodal Massive Multi-task Understanding (MMMU) benchmark. Its performance on MathVision and OlympiadBench benchmarks shows significant improvements in mathematical reasoning and problem-solving tasks.
Guide: Running Locally
-
Install Dependencies:
pip install qwen-vl-utils
-
Use the Model:
from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor from qwen_vl_utils import process_vision_info model = Qwen2VLForConditionalGeneration.from_pretrained( "Qwen/QVQ-72B-Preview", torch_dtype="auto", device_map="auto" ) processor = AutoProcessor.from_pretrained("Qwen/QVQ-72B-Preview")
-
Process Inputs: Prepare text and images using
processor.apply_chat_template
andprocess_vision_info
. -
Run Inference: Utilize
model.generate
to obtain outputs and decode them withprocessor.batch_decode
. -
Hardware Recommendations: Consider using cloud GPUs like AWS EC2 P3 instances or Google Cloud's NVIDIA A100 for optimal performance.
License
The model is released under the Qwen license. For more details, refer to the license document.