Mini Intern V L Chat 2 B V1 5

OpenGVLab

Introduction

The Mini-InternVL-Chat series introduces a compact multimodal large language model (MLLM) inspired by the advancements in smaller models such as Gemma-2B and InternLM2-1.8B. This model is distilled from the larger InternViT-6B-448px-V1-5 into a smaller form, integrating InternLM2-Chat-1.8B or Phi-3-mini-128k-instruct as its language model. The result is a high-performing, cost-effective model.

Architecture

The Mini-InternVL-Chat-2B-V1-5 model features:

  • Architecture: Combination of InternViT-300M-448px, MLP, and InternLM2-Chat-1.8B.
  • Resolution: Dynamic, with a maximum of 40 tiles of 448 x 448 pixels.
  • Parameters: Approximately 2.2 billion.
  • The model architecture is derived from InternVL 1.5 with a simplified structure.

Training

The model was trained using the same dataset as InternVL 1.5, leveraging a context length of 8K, which is feasible due to the reduced size of the model. The training process involves both pre-training and fine-tuning stages, focusing on the learnable components ViT, MLP, and LLM.

Guide: Running Locally

To run Mini-InternVL-Chat-2B-V1-5 locally, ensure you have transformers>=4.37.2. Here are the basic steps:

  1. Install Dependencies:
    Ensure you have the required Python libraries, particularly transformers.

  2. Load the Model:
    Use AutoModel from the transformers library to load the model. It supports 16-bit, 8-bit, and 4-bit quantization.

    from transformers import AutoModel
    model = AutoModel.from_pretrained("OpenGVLab/Mini-InternVL-Chat-2B-V1-5").eval().cuda()
    
  3. Inference:
    Prepare your inputs (text, images, or video) and use the model to generate responses.

    response = model.chat(tokenizer, pixel_values, question, generation_config)
    
  4. Multi-GPU Setup:
    If using multiple GPUs, ensure the model layers are distributed correctly across devices to avoid errors.

Cloud GPUs such as those from AWS, Google Cloud, or Azure are recommended for handling larger workloads and speeding up processing.

License

This project is licensed under the MIT License. It incorporates the pre-trained internlm2-chat-1_8b, which is licensed under the Apache License 2.0.

More Related APIs in Image Text To Text