Intern V L2_5 8 B A W Q LLM Model

Introduction

InternVL 2.5 is an advanced multimodal large language model (MLLM) series that extends the capabilities of the InternVL 2.0 architecture. It introduces enhancements in training, testing strategies, and data quality while maintaining the core architecture of its predecessors.

Architecture

InternVL 2.5 retains the "ViT-MLP-LLM" architecture, integrating a newly pre-trained InternViT with pre-trained LLMs, such as InternLM 2.5 and Qwen 2.5. A randomly initialized MLP projector is used. Key features include a pixel unshuffle operation to reduce visual tokens and a dynamic resolution strategy, allowing for multi-image and video data support.

Training

The training process for InternVL 2.5 involves incremental pre-training of the vision and language components, leveraging high-quality data to enhance the model's performance across various tasks. The integration of multiple pre-trained LLMs provides flexibility and improved conversational capabilities.

Guide: Running Locally

Installation: Install the LMDeploy toolkit using pip:
```
pip install lmdeploy>=0.6.4
```

Basic Usage Example:

Load an image and use the pipeline to generate a description:

from lmdeploy import pipeline, load_image

model = 'OpenGVLab/InternVL2_5-8B-AWQ'
image = load_image('https://example.com/image.jpg')
pipe = pipeline(model)
response = pipe(('describe this image', image))
print(response.text)

Multi-Image and Batch Prompts:
- For multiple images, load them into a list and increase context window size.
- Batch prompts can be handled by placing them in a list.
Cloud GPUs: For optimal performance, especially with larger models, consider using cloud GPU services like AWS EC2, Google Cloud, or Azure for faster processing and scalability.

License

This project is released under the MIT License. It incorporates components like the pre-trained internlm2_5-7b-chat, which is licensed under the Apache License 2.0.

More Related APIs in Image Text To Text