Intern V L2_5 1 B

OpenGVLab

Introduction

InternVL 2.5 is an advanced multimodal large language model (MLLM) series, building upon InternVL 2.0, with enhancements in both training/testing strategies and data quality. It combines visual and language model components for improved interaction between modalities.

Architecture

InternVL 2.5 retains the "ViT-MLP-LLM" architecture from its predecessors, integrating a pre-trained InternViT with various LLMs such as InternLM 2.5 and Qwen 2.5 using a newly initialized MLP projector. Key architectural refinements include pixel unshuffle operations to reduce visual token count and improved support for multi-image and video data.

Training

The training strategy involves dynamic high-resolution processing for multimodal datasets, a three-stage training pipeline enhancing visual perception and multimodal capabilities, and a progressive scaling strategy to align vision encoders with various LLMs. Techniques like random JPEG compression and loss reweighting are applied to enhance robustness and balance across dataset responses.

Guide: Running Locally

  1. Install Dependencies:

    • Ensure your environment uses transformers>=4.37.2.
    • Install necessary libraries such as PyTorch and Hugging Face's Transformers.
  2. Model Loading:

    • Load the model with 16-bit precision or 8-bit quantization using PyTorch, enabling settings like bfloat16 and low memory usage.
  3. Multi-GPU Configuration:

    • Use a device map to allocate model layers across multiple GPUs to prevent device mismatch errors.
  4. Inference:

    • Utilize dynamic preprocessing for images and videos.
    • Conduct image-text interactions and video analysis using the model's chat interface.
  5. Cloud GPU Suggestion:

    • Consider using cloud GPU services like AWS, Google Cloud, or Azure for efficient training and inference.

License

The InternVL 2.5 series is released under the MIT License. Components such as the pre-trained Qwen2.5-0.5B-Instruct are licensed under the Apache License 2.0.

More Related APIs in Image Text To Text