Qwen2 V L 7 B Instruct

Qwen

Introduction

Qwen2-VL is an advanced vision-language model, succeeding Qwen-VL, with cutting-edge capabilities in image and video understanding, multilingual text comprehension, and integration with mobile and robotic devices. It excels in visual understanding benchmarks and supports numerous languages beyond English and Chinese.

Architecture

Qwen2-VL features several architectural enhancements:

  • Naive Dynamic Resolution: Supports arbitrary image resolutions and dynamically maps them into visual tokens for flexible processing.
  • Multimodal Rotary Position Embedding (M-ROPE): Captures 1D textual, 2D visual, and 3D video positional information, boosting multimodal processing capabilities.
  • Available in configurations with 2, 7, and 72 billion parameters, the 7B instruction-tuned model is highlighted.

Training

Qwen2-VL achieves state-of-the-art results on multiple benchmarks for visual and video understanding, including MathVista, DocVQA, and others. The model supports multiple languages and can be integrated into devices for automatic operation based on visual and text instructions.

Guide: Running Locally

To run Qwen2-VL locally:

  1. Install Dependencies:

    pip install git+https://github.com/huggingface/transformers
    pip install qwen-vl-utils
    
  2. Load the Model:

    from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
    model = Qwen2VLForConditionalGeneration.from_pretrained(
        "Qwen/Qwen2-VL-7B-Instruct", torch_dtype="auto", device_map="auto"
    )
    processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-7B-Instruct")
    
  3. Prepare Inputs for Inference: Configure input images or videos and use the processor to prepare the data.

  4. Run Inference:

    inputs = processor(text=[text], images=image_inputs, videos=video_inputs, padding=True, return_tensors="pt").to("cuda")
    generated_ids = model.generate(**inputs, max_new_tokens=128)
    output_text = processor.batch_decode(generated_ids, skip_special_tokens=True)
    

For optimal performance, consider using cloud GPUs like those offered by AWS, Google Cloud, or Azure.

License

Qwen2-VL is released under the Apache License 2.0, allowing for broad usage and distribution.

More Related APIs in Image Text To Text