Intern V L2_5 8 B

OpenGVLab

Introduction

InternVL 2.5 is an advanced multimodal large language model (MLLM) series, enhancing previous versions by improving training and testing strategies and data quality. It builds on the architecture of InternVL 2.0 and incorporates significant advancements in handling multimodal data, including images and videos.

Architecture

InternVL 2.5 maintains the "ViT-MLP-LLM" architecture, integrating a pre-trained InternViT with LLMs like InternLM 2.5 and Qwen 2.5. It employs a pixel unshuffle operation to reduce visual tokens, and uses dynamic resolution strategies for image processing. The architecture supports multi-image and video data, enhancing the model's flexibility and capability in handling various data types.

Training

InternVL 2.5 employs a dynamic high-resolution training strategy to accommodate multi-image and video datasets. The training pipeline is divided into stages:

  • Stage 1: MLP warmup with frozen vision and language models.
  • Stage 1.5: Optional incremental learning for the vision encoder.
  • Stage 2: Full model instruction tuning on high-quality datasets.

Progressive scaling is used to efficiently align the vision encoder with LLMs, minimizing redundancy and maximizing component reuse. Training enhancements include random JPEG compression and loss reweighting to improve model robustness and balance.

Guide: Running Locally

Basic Steps

  1. Install Required Libraries: Ensure you have transformers>=4.37.2 and other necessary libraries installed.
  2. Model Loading: Use AutoTokenizer and AutoModel from the transformers library to load the model.
    import torch
    from transformers import AutoTokenizer, AutoModel
    path = "OpenGVLab/InternVL2_5-8B"
    model = AutoModel.from_pretrained(
        path,
        torch_dtype=torch.bfloat16,
        low_cpu_mem_usage=True,
        use_flash_attn=True,
        trust_remote_code=True).eval().cuda()
    
  3. Inference: Prepare your data and use the model for inference, handling single or multiple GPUs as needed.

Cloud GPUs

Utilizing cloud GPUs such as AWS EC2 with NVIDIA GPUs or Google Cloud's AI Platform can significantly enhance performance, especially for large models like InternVL 2.5.

License

InternVL 2.5 is released under the MIT License. It incorporates components like the pre-trained internlm2_5-7b-chat, which is licensed under the Apache License 2.0.

More Related APIs in Image Text To Text