Q V Q 72 B Preview

Qwen

Introduction

QVQ-72B-Preview is an experimental research model developed by the Qwen team to enhance visual reasoning capabilities.

Architecture

The model is based on the Qwen2-VL-72B architecture, utilizing transformers for image-text-to-text processing. It demonstrates strong performance in multidisciplinary understanding and reasoning.

Training

QVQ-72B-Preview is designed to improve visual reasoning and handles various input types such as base64, URLs, and interleaved images and videos. However, it currently only supports single-round dialogues and image outputs, not video inputs.

Model Stats Number

The model has achieved a score of 70.3% on the Multimodal Massive Multi-task Understanding (MMMU) benchmark. Its performance on MathVision and OlympiadBench benchmarks shows significant improvements in mathematical reasoning and problem-solving tasks.

Guide: Running Locally

  1. Install Dependencies:

    pip install qwen-vl-utils
    
  2. Use the Model:

    from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
    from qwen_vl_utils import process_vision_info
    
    model = Qwen2VLForConditionalGeneration.from_pretrained(
        "Qwen/QVQ-72B-Preview", torch_dtype="auto", device_map="auto"
    )
    
    processor = AutoProcessor.from_pretrained("Qwen/QVQ-72B-Preview")
    
  3. Process Inputs: Prepare text and images using processor.apply_chat_template and process_vision_info.

  4. Run Inference: Utilize model.generate to obtain outputs and decode them with processor.batch_decode.

  5. Hardware Recommendations: Consider using cloud GPUs like AWS EC2 P3 instances or Google Cloud's NVIDIA A100 for optimal performance.

License

The model is released under the Qwen license. For more details, refer to the license document.

More Related APIs in Image Text To Text