Qwen-VL

Introduction

Qwen-VL is a large-scale Vision Language Model (LVLM) developed by Alibaba Cloud. It processes images, text, and bounding boxes as inputs and provides text and bounding box outputs. Qwen-VL supports multilingual dialogue, multi-image dialogue, open-domain localization in Chinese, and detailed image recognition.

Architecture

Qwen-VL consists of two models: Qwen-VL and Qwen-VL-Chat. The architecture allows for image and text inputs to generate text and bounding box outputs, facilitating tasks like multilingual dialogue and image recognition.

Training

The training process incorporates diverse datasets to enhance the model's performance in tasks like zero-shot captioning and visual question answering (VQA). Evaluation benchmarks include standard tasks and the TouchStone benchmark, covering various multimodal tasks and human-aligned dialogue capabilities.

Guide: Running Locally

Requirements
- Python 3.8 or above
- PyTorch 1.12 or above (2.0 recommended)
- CUDA 11.4 or above for GPU users
Setup
- Ensure all dependencies are installed: pip install -r requirements.txt

Usage

Load the model using the transformers library.

Example code snippet for model instantiation:

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-VL", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-VL", device_map="cuda", trust_remote_code=True).eval()

Cloud GPUs
- For optimal performance, consider using cloud-based GPU resources such as AWS EC2, Google Cloud, or Azure.

License

Qwen-VL and Qwen-VL-Chat are available for research and development purposes, with permissions for commercial use. Detailed license information can be found in the LICENSE file. For commercial usage, complete the application form provided by Alibaba Cloud.