Q V Q 72 B Preview 4.65bpw h6 exl2
wolframIntroduction
QVQ-72B-Preview is an experimental research model by the Qwen team designed to enhance visual reasoning capabilities. It demonstrates significant performance across various benchmarks, highlighting its multidisciplinary understanding and reasoning abilities.
Architecture
QVQ-72B-Preview is a quantized version of the Qwen/QVQ-72B model, supporting a 32K context with Q4 cache on systems with 48 GB VRAM. It is built around the capabilities of the Qwen2-VL-72B base model and uses the transformers
library.
Training
The model has been trained to excel in visual reasoning tasks and has shown impressive results on benchmarks like the Multimodal Massive Multi-task Understanding (MMMU) and MathVision. However, it has limitations in language mixing, recursive reasoning loops, and maintaining focus during multi-step visual reasoning tasks. Safety and ethical considerations are crucial when deploying this model.
Guide: Running Locally
-
Installation:
- Install the toolkit for handling visual input:
pip install qwen-vl-utils
- Install the toolkit for handling visual input:
-
Setup:
- Load the model and processor:
from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor from qwen_vl_utils import process_vision_info model = Qwen2VLForConditionalGeneration.from_pretrained( "Qwen/QVQ-72B-Preview", torch_dtype="auto", device_map="auto" ) processor = AutoProcessor.from_pretrained("Qwen/QVQ-72B-Preview")
- Load the model and processor:
-
Inference:
- Prepare input data and perform inference:
messages = [ {"role": "system", "content": [{"type": "text", "text": "You are a helpful and harmless assistant. You are Qwen developed by Alibaba. You should think step-by-step."}]}, {"role": "user", "content": [{"type": "image", "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/QVQ/demo.png"}, {"type": "text", "text": "What value should be filled in the blank space?"}]} ] text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) image_inputs, video_inputs = process_vision_info(messages) inputs = processor(text=[text], images=image_inputs, videos=video_inputs, padding=True, return_tensors="pt").to("cuda") generated_ids = model.generate(**inputs, max_new_tokens=8192) output_text = processor.batch_decode([out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)], skip_special_tokens=True, clean_up_tokenization_spaces=False) print(output_text)
- Prepare input data and perform inference:
-
Hardware Recommendation:
- Use cloud services with GPUs like NVIDIA A100 to efficiently run the model due to its high resource demands.
License
The model is released under the Qwen license, and further details can be found here.