Qwen2 V L 72 B Instruct LLM Model

Introduction

Qwen2-VL is an advanced version of the Qwen-VL model, showcasing significant enhancements in multimodal capabilities such as image, video, and multilingual understanding. It supports complex reasoning and decision-making processes, making it suitable for integration with devices like mobile phones and robots.

Architecture

The model incorporates several architectural improvements:

Naive Dynamic Resolution: Allows handling of images with arbitrary resolutions by mapping them into dynamic visual tokens.
Multimodal Rotary Position Embedding (M-ROPE): Captures positional information across 1D textual, 2D visual, and 3D video inputs, enhancing multimodal processing.
The repository contains a 72 billion parameter instruction-tuned model, with additional models available at different parameter scales.

Training

Qwen2-VL features state-of-the-art (SoTA) performance on benchmarks like MathVista and DocVQA, and supports multiple languages, including most European languages, Japanese, Korean, Arabic, and Vietnamese. It can interpret videos over 20 minutes long and can be used for video-based question answering and content creation.

Guide: Running Locally

Basic Steps

Installation:
Install the required libraries:

pip install git+https://github.com/huggingface/transformers
pip install qwen-vl-utils

Model Loading:

from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
model = Qwen2VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2-VL-72B-Instruct", torch_dtype="auto", device_map="auto"
)
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-72B-Instruct")

Inference: Use the model for inference with images or videos by preparing messages and processing them with the model and processor.

Cloud GPUs

For optimal performance, especially in video and multi-image scenarios, consider using cloud GPUs such as AWS, Google Cloud, or Azure.

License

The model is released under the "tongyi-qianwen" license. For more details, refer to the license link.

More Related APIs in Image Text To Text