Intern V L2_5 8 B
OpenGVLabIntroduction
InternVL 2.5 is an advanced multimodal large language model (MLLM) series, enhancing previous versions by improving training and testing strategies and data quality. It builds on the architecture of InternVL 2.0 and incorporates significant advancements in handling multimodal data, including images and videos.
Architecture
InternVL 2.5 maintains the "ViT-MLP-LLM" architecture, integrating a pre-trained InternViT with LLMs like InternLM 2.5 and Qwen 2.5. It employs a pixel unshuffle operation to reduce visual tokens, and uses dynamic resolution strategies for image processing. The architecture supports multi-image and video data, enhancing the model's flexibility and capability in handling various data types.
Training
InternVL 2.5 employs a dynamic high-resolution training strategy to accommodate multi-image and video datasets. The training pipeline is divided into stages:
- Stage 1: MLP warmup with frozen vision and language models.
- Stage 1.5: Optional incremental learning for the vision encoder.
- Stage 2: Full model instruction tuning on high-quality datasets.
Progressive scaling is used to efficiently align the vision encoder with LLMs, minimizing redundancy and maximizing component reuse. Training enhancements include random JPEG compression and loss reweighting to improve model robustness and balance.
Guide: Running Locally
Basic Steps
- Install Required Libraries: Ensure you have
transformers>=4.37.2
and other necessary libraries installed. - Model Loading: Use
AutoTokenizer
andAutoModel
from the transformers library to load the model.import torch from transformers import AutoTokenizer, AutoModel path = "OpenGVLab/InternVL2_5-8B" model = AutoModel.from_pretrained( path, torch_dtype=torch.bfloat16, low_cpu_mem_usage=True, use_flash_attn=True, trust_remote_code=True).eval().cuda()
- Inference: Prepare your data and use the model for inference, handling single or multiple GPUs as needed.
Cloud GPUs
Utilizing cloud GPUs such as AWS EC2 with NVIDIA GPUs or Google Cloud's AI Platform can significantly enhance performance, especially for large models like InternVL 2.5.
License
InternVL 2.5 is released under the MIT License. It incorporates components like the pre-trained internlm2_5-7b-chat
, which is licensed under the Apache License 2.0.