Qwen2 V L 7 B Instruct G P T Q Int4

Qwen

Introduction

Qwen2-VL-7B-Instruct-GPTQ-Int4 is the latest iteration of the Qwen-VL model, showcasing advancements in visual understanding and multimodal processing. It offers state-of-the-art capabilities in processing and understanding images, videos, and multilingual text, making it suitable for various applications, including mobile and robotic integrations.

Architecture

Key architectural updates include:

  • Naive Dynamic Resolution: Handles arbitrary image resolutions, mapping them into dynamic visual tokens.
  • Multimodal Rotary Position Embedding (M-ROPE): Enhances processing capabilities by capturing positional information across textual, visual, and video dimensions.
    The repository includes the instruction-tuned 7B parameter model, with additional configurations available for different use cases.

Training

The Qwen2-VL series demonstrates quantized models' performance across various benchmarks, such as MMMU_VAL, DocVQA_VAL, MMBench_DEV_EN, and MathVista_MINI. Speed benchmarks indicate its efficient inference capabilities on GPUs like the NVIDIA A100, with quantizations such as BF16, GPTQ-Int8, GPTQ-Int4, and AWQ enhancing performance.

Guide: Running Locally

  1. Install Dependencies:

    • Ensure the latest version of Hugging Face transformers is installed:
      pip install git+https://github.com/huggingface/transformers
      
    • Install the qwen-vl-utils for handling various visual inputs:
      pip install qwen-vl-utils
      
  2. Set Up the Model:

    from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
    model = Qwen2VLForConditionalGeneration.from_pretrained(
        "Qwen/Qwen2-VL-7B-Instruct-GPTQ-Int4", torch_dtype="auto", device_map="auto"
    )
    processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-7B-Instruct-GPTQ-Int4")
    
  3. Prepare Inputs and Run Inference:

    • Process messages containing text and images/videos using qwen_vl_utils or manually.
    • Send inputs to the model and decode the outputs.

For optimal performance, particularly with larger inputs, consider using cloud GPUs like the NVIDIA A100.

License

The model is licensed under the Apache 2.0 License, allowing for broad use and distribution.

More Related APIs in Image Text To Text