Intern V L2 26 B
OpenGVLabIntroduction
InternVL 2.0 is the latest in the series of multimodal large language models, boasting a range of instruction-tuned models from 1 billion to 108 billion parameters. This repository hosts the InternVL2-26B model. InternVL 2.0 outperforms many open-source models and is competitive with commercial models in tasks such as document comprehension, scene text understanding, and multimodal capabilities. It features an 8k context window and is trained using long texts, images, and videos, improving upon its predecessor, InternVL 1.5.
Architecture
InternVL2-26B is composed of the InternViT-6B-448px-V1-5 vision component and the internlm2-chat-20b language component. These components are optimized for multimodal tasks and leverage a combination of vision and language processing capabilities.
Training
InternVL 2.0 is trained with extensive multimodal data and is capable of processing long text along with multiple images and videos. It includes various instruction-tuned models optimized for different multimodal tasks. Its training process focuses on enhancing document comprehension, scene text understanding, and cultural insights.
Guide: Running Locally
To run InternVL2-26B locally:
- Environment Setup: Ensure you have
transformers
version 4.37.2 or higher. - Model Loading:
- Use FP16 or BF16 precision for loading the model.
- Multi-GPU setup is supported to handle larger models.
- Inference:
- Import necessary libraries like
torch
,transformers
, and others as shown. - Load the model using
AutoModel.from_pretrained
and set it to evaluation mode. - Preprocess input images and videos as necessary.
- Import necessary libraries like
- Run Inference: Use the provided scripts to execute inference tasks.
For better performance, consider using cloud GPUs such as those from AWS or Google Cloud.
License
This project is licensed under the MIT License. It incorporates the pre-trained internlm2-chat-20b, which is under the Apache License 2.0.