Qwen V L
QwenQwen-VL
Introduction
Qwen-VL is a large-scale Vision Language Model (LVLM) developed by Alibaba Cloud. It processes images, text, and bounding boxes as inputs and provides text and bounding box outputs. Qwen-VL supports multilingual dialogue, multi-image dialogue, open-domain localization in Chinese, and detailed image recognition.
Architecture
Qwen-VL consists of two models: Qwen-VL and Qwen-VL-Chat. The architecture allows for image and text inputs to generate text and bounding box outputs, facilitating tasks like multilingual dialogue and image recognition.
Training
The training process incorporates diverse datasets to enhance the model's performance in tasks like zero-shot captioning and visual question answering (VQA). Evaluation benchmarks include standard tasks and the TouchStone benchmark, covering various multimodal tasks and human-aligned dialogue capabilities.
Guide: Running Locally
-
Requirements
- Python 3.8 or above
- PyTorch 1.12 or above (2.0 recommended)
- CUDA 11.4 or above for GPU users
-
Setup
- Ensure all dependencies are installed:
pip install -r requirements.txt
- Ensure all dependencies are installed:
-
Usage
- Load the model using the
transformers
library. - Example code snippet for model instantiation:
from transformers import AutoModelForCausalLM, AutoTokenizer import torch tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-VL", trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-VL", device_map="cuda", trust_remote_code=True).eval()
- Load the model using the
-
Cloud GPUs
- For optimal performance, consider using cloud-based GPU resources such as AWS EC2, Google Cloud, or Azure.
License
Qwen-VL and Qwen-VL-Chat are available for research and development purposes, with permissions for commercial use. Detailed license information can be found in the LICENSE file. For commercial usage, complete the application form provided by Alibaba Cloud.