Q V Q 72 B Preview bnb 4bit

unsloth

Introduction

QVQ-72B-Preview is an experimental research model developed by the Qwen team, designed to enhance visual reasoning capabilities. It exhibits strong performance across various benchmarks, demonstrating its multidisciplinary understanding and reasoning abilities.

Architecture

The QVQ-72B-Preview model operates with advanced visual reasoning capabilities. It is implemented using the transformers library and supports 4-bit precision through bitsandbytes. The model leverages a sophisticated architecture optimized for image-text-to-text tasks.

Training

The model has been trained to excel in multidisciplinary understanding and reasoning. It has shown significant improvements in mathematical reasoning tasks and enhanced abilities in tackling challenging problems. However, it also has limitations such as language mixing, recursive reasoning loops, and performance constraints in basic recognition tasks.

Guide: Running Locally

To run QVQ-72B-Preview locally, follow these steps:

  1. Install the Toolkit:

    pip install qwen-vl-utils
    
  2. Load the Model: Use the following Python code to load the model and processor.

    from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
    from qwen_vl_utils import process_vision_info
    
    model = Qwen2VLForConditionalGeneration.from_pretrained(
        "Qwen/QVQ-72B-Preview", torch_dtype="auto", device_map="auto"
    )
    processor = AutoProcessor.from_pretrained("Qwen/QVQ-72B-Preview")
    
  3. Prepare Inference Inputs: Configure your inputs for text, images, and videos as shown in the example code.

  4. Run Inference on a GPU: Ensure the inputs are transferred to CUDA for processing:

    inputs = inputs.to("cuda")
    
  5. Generate Outputs: Use the model to generate outputs and decode them for interpretation.

    generated_ids = model.generate(**inputs, max_new_tokens=8192)
    output_text = processor.batch_decode(
        generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
    )
    print(output_text)
    

Cloud GPUs: Consider using cloud platforms like AWS, Google Cloud, or Azure for access to powerful GPUs that can handle the model's requirements efficiently.

License

The model is licensed under the "qwen" license. For more details, refer to the license link.

More Related APIs in Image Text To Text