Intern V L2_5 38 B LLM Model

Introduction

InternVL 2.5 is an advanced multimodal large language model (MLLM) series that builds upon InternVL 2.0, introducing significant enhancements in training, testing strategies, and data quality while maintaining its core model architecture.

Architecture

InternVL 2.5 retains the same "ViT-MLP-LLM" architecture as its predecessors, integrating a newly incrementally pre-trained InternViT with various pre-trained LLMs, such as InternLM 2.5 and Qwen 2.5. The model features a pixel unshuffle operation, reducing visual tokens to one-quarter of the original, and supports multi-image and video data. Images are divided into tiles of 448×448 pixels.

Training

Dynamic High-Resolution for Multimodal Data

InternVL 2.5 extends the dynamic high-resolution training approach to handle multi-image and video datasets. This involves allocating tiles for single images and distributing them across all images in multi-image datasets. Videos are processed frame by frame.

Single Model Training Pipeline

The training pipeline comprises three stages:

Stage 1: MLP Warmup - Only the MLP projector is trained using dynamic high-resolution training.
Stage 1.5: ViT Incremental Learning (Optional) - Incremental training of the vision encoder and MLP projector to handle rare domains.
Stage 2: Full Model Instruction Tuning - The entire model is trained on high-quality multimodal instruction datasets.

Progressive Scaling Strategy

The model employs a progressive scaling strategy to align the vision encoder efficiently with LLMs, starting with smaller LLMs and transferring to larger models without retraining.

Training Enhancements

Random JPEG Compression simulates image degradation to improve robustness.
Loss Reweighting balances contributions from responses of varying lengths.

Data Organization

Key parameters like data augmentation, maximum tile number, and repeat factor control the organization and balance of datasets. A data filtering pipeline removes low-quality samples to ensure high-quality data for training.

Guide: Running Locally

Basic Steps

Model Loading: Load the model using transformers library with the specified configurations.
Inference: Utilize the provided code examples for single-image, multi-image, and video inference.
Fine-Tuning and Deployment: Refer to the documentation for fine-tuning options and deployment using LMDeploy.

Cloud GPUs

Running the model efficiently requires cloud GPUs. Consider using platforms like AWS, Google Cloud, or Azure with high-memory GPUs such as NVIDIA A100.

License

This project is released under the MIT License. It includes components like Qwen2.5-32B-Instruct, licensed under the Apache License 2.0.

More Related APIs in Image Text To Text