Intern V L2_5 78 B

OpenGVLab

Introduction

InternVL 2.5 is an advanced multimodal large language model (MLLM) series, building on the InternVL 2.0 architecture. It incorporates significant enhancements in training, testing strategies, and data quality, making it robust for diverse applications.

Architecture

InternVL 2.5 maintains the "ViT-MLP-LLM" architecture from previous versions. It integrates a pre-trained InternViT with various LLMs, including InternLM 2.5 and Qwen 2.5, using an MLP projector. The model reduces visual tokens via pixel unshuffle and supports multi-image and video data, with images divided into 448×448 pixel tiles.

Training

InternVL 2.5 extends dynamic high-resolution training to handle multi-image and video datasets, comprising three stages:

  1. MLP Warmup: Trains the MLP projector with dynamic high-resolution strategies.
  2. ViT Incremental Learning (Optional): Enhances the vision encoder's capacity, useful for rare domains.
  3. Full Model Instruction Tuning: Utilizes high-quality multimodal datasets for comprehensive training.

A progressive scaling strategy optimizes training by initially aligning the vision encoder with smaller LLMs before scaling up.

Training Enhancements

  • Random JPEG Compression: Enhances image robustness by simulating internet degradation.
  • Loss Reweighting: Balances loss across varied-length responses using square averaging.

Data Organization

Key parameters control data distribution to optimize training:

  • Data Augmentation: Conditional JPEG compression to ensure dataset robustness.
  • Maximum Tile Number: Governs tile allocation per dataset.
  • Repeat Factor: Adjusts sampling frequency to maintain balance.

A data filtering pipeline ensures high-quality samples by scoring, detecting repetition, and applying heuristic rules.

Guide: Running Locally

To run InternVL2_5-78B locally, follow these steps:

  1. Install Dependencies:

    • Ensure transformers>=4.37.2 is installed.
  2. Load the Model:

    • Use PyTorch to load the model with BF16 or 8-bit quantization.
  3. GPU Configuration:

    • For multi-GPU setups, distribute model layers across GPUs to prevent device errors.
  4. Inference:

    • Prepare images or videos with appropriate preprocessing.
    • Utilize the model's chat interface for multimodal or text-based queries.
  5. Cloud GPUs:

    • Consider cloud services like AWS or Google Cloud for access to powerful GPUs, especially for handling large models.

License

This project is released under the MIT License. It uses the pre-trained Qwen2.5-72B-Instruct, licensed under the Qwen License.

More Related APIs in Image Text To Text