Intern V L2_5 38 B
OpenGVLabIntroduction
InternVL 2.5 is an advanced multimodal large language model (MLLM) series that builds upon InternVL 2.0, introducing significant enhancements in training, testing strategies, and data quality while maintaining its core model architecture.
Architecture
InternVL 2.5 retains the same "ViT-MLP-LLM" architecture as its predecessors, integrating a newly incrementally pre-trained InternViT with various pre-trained LLMs, such as InternLM 2.5 and Qwen 2.5. The model features a pixel unshuffle operation, reducing visual tokens to one-quarter of the original, and supports multi-image and video data. Images are divided into tiles of 448×448 pixels.
Training
Dynamic High-Resolution for Multimodal Data
InternVL 2.5 extends the dynamic high-resolution training approach to handle multi-image and video datasets. This involves allocating tiles for single images and distributing them across all images in multi-image datasets. Videos are processed frame by frame.
Single Model Training Pipeline
The training pipeline comprises three stages:
- Stage 1: MLP Warmup - Only the MLP projector is trained using dynamic high-resolution training.
- Stage 1.5: ViT Incremental Learning (Optional) - Incremental training of the vision encoder and MLP projector to handle rare domains.
- Stage 2: Full Model Instruction Tuning - The entire model is trained on high-quality multimodal instruction datasets.
Progressive Scaling Strategy
The model employs a progressive scaling strategy to align the vision encoder efficiently with LLMs, starting with smaller LLMs and transferring to larger models without retraining.
Training Enhancements
- Random JPEG Compression simulates image degradation to improve robustness.
- Loss Reweighting balances contributions from responses of varying lengths.
Data Organization
Key parameters like data augmentation, maximum tile number, and repeat factor control the organization and balance of datasets. A data filtering pipeline removes low-quality samples to ensure high-quality data for training.
Guide: Running Locally
Basic Steps
- Model Loading: Load the model using
transformers
library with the specified configurations. - Inference: Utilize the provided code examples for single-image, multi-image, and video inference.
- Fine-Tuning and Deployment: Refer to the documentation for fine-tuning options and deployment using LMDeploy.
Cloud GPUs
Running the model efficiently requires cloud GPUs. Consider using platforms like AWS, Google Cloud, or Azure with high-memory GPUs such as NVIDIA A100.
License
This project is released under the MIT License. It includes components like Qwen2.5-32B-Instruct, licensed under the Apache License 2.0.