Qwen2 V L 2 B Instruct

Qwen

Introduction

Qwen2-VL is the latest version of the Qwen-VL model, showcasing nearly a year of advancements. It offers state-of-the-art capabilities in visual understanding across various benchmarks, long video comprehension, multilingual support, and integration with devices for automatic operations. Enhancements include handling arbitrary image resolutions and improved positional embedding for multimodal data.

Architecture

Qwen2-VL features Naive Dynamic Resolution, allowing it to manage images of any resolution by mapping them into a dynamic number of visual tokens. It also utilizes Multimodal Rotary Position Embedding (M-ROPE) to effectively process 1D textual, 2D visual, and 3D video positional information. The model is available in configurations with 2, 7, and 72 billion parameters, with the repository containing the instruction-tuned 2B model.

Training

The model is evaluated against several benchmarks, demonstrating strong performance in image and video tasks. The training requirements include installing the latest Hugging Face transformers library and the qwen-vl-utils package for handling various visual inputs.

Guide: Running Locally

Basic Steps

  1. Install Required Libraries:

    pip install git+https://github.com/huggingface/transformers
    pip install qwen-vl-utils
    
  2. Load the Model and Processor:

    from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
    
    model = Qwen2VLForConditionalGeneration.from_pretrained(
        "Qwen/Qwen2-VL-2B-Instruct", torch_dtype="auto", device_map="auto"
    )
    processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-2B-Instruct")
    
  3. Prepare Inputs for Inference:

    • Use qwen_vl_utils for processing vision information.
    • Set up messages and prepare inputs using AutoProcessor.
  4. Run Inference:

    generated_ids = model.generate(**inputs, max_new_tokens=128)
    output_text = processor.batch_decode(generated_ids, skip_special_tokens=True)
    print(output_text)
    

Cloud GPUs

For optimal performance, especially in scenarios involving multiple images or videos, it is recommended to use cloud GPUs such as AWS EC2 instances with GPU support, Google Cloud GPU offerings, or Azure GPU virtual machines.

License

Qwen2-VL is available under the Apache-2.0 license, allowing for wide use and modification with proper attribution.

More Related APIs in Image Text To Text