Q V Q 72 B Preview 4.65bpw h6 exl2 LLM Model

Introduction

QVQ-72B-Preview is an experimental research model by the Qwen team designed to enhance visual reasoning capabilities. It demonstrates significant performance across various benchmarks, highlighting its multidisciplinary understanding and reasoning abilities.

Architecture

QVQ-72B-Preview is a quantized version of the Qwen/QVQ-72B model, supporting a 32K context with Q4 cache on systems with 48 GB VRAM. It is built around the capabilities of the Qwen2-VL-72B base model and uses the transformers library.

Training

The model has been trained to excel in visual reasoning tasks and has shown impressive results on benchmarks like the Multimodal Massive Multi-task Understanding (MMMU) and MathVision. However, it has limitations in language mixing, recursive reasoning loops, and maintaining focus during multi-step visual reasoning tasks. Safety and ethical considerations are crucial when deploying this model.

Guide: Running Locally

Installation:
- Install the toolkit for handling visual input:
```
pip install qwen-vl-utils
```

Setup:

Load the model and processor:

from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info

model = Qwen2VLForConditionalGeneration.from_pretrained(
    "Qwen/QVQ-72B-Preview", torch_dtype="auto", device_map="auto"
)
processor = AutoProcessor.from_pretrained("Qwen/QVQ-72B-Preview")

Inference:

Prepare input data and perform inference:

messages = [
    {"role": "system", "content": [{"type": "text", "text": "You are a helpful and harmless assistant. You are Qwen developed by Alibaba. You should think step-by-step."}]},
    {"role": "user", "content": [{"type": "image", "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/QVQ/demo.png"}, {"type": "text", "text": "What value should be filled in the blank space?"}]}
]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(text=[text], images=image_inputs, videos=video_inputs, padding=True, return_tensors="pt").to("cuda")
generated_ids = model.generate(**inputs, max_new_tokens=8192)
output_text = processor.batch_decode([out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)], skip_special_tokens=True, clean_up_tokenization_spaces=False)
print(output_text)

Hardware Recommendation:
- Use cloud services with GPUs like NVIDIA A100 to efficiently run the model due to its high resource demands.

License

The model is released under the Qwen license, and further details can be found here.

More Related APIs in Image Text To Text