Intern V L2_5 8 B M P O

OpenGVLab

Introduction

InternVL2.5-MPO is an advanced multimodal large language model (MLLM) series that builds upon InternVL2.5 and Mixed Preference Optimization. It is designed to demonstrate superior performance across various tasks by integrating visual and textual data processing capabilities.

Architecture

InternVL2.5-MPO retains the "ViT-MLP-LLM" architecture paradigm from its predecessors. It integrates a pre-trained InternViT with large language models like InternLM 2.5 and Qwen 2.5 through a randomly initialized MLP projector. The model employs pixel unshuffle operations to reduce visual tokens and uses a dynamic resolution strategy to process images as 448×448 pixel tiles. It supports multi-image and video data inputs.

Training

The training process uses Mixed Preference Optimization (MPO), which combines preference loss, quality loss, and generation loss. This approach helps the model learn relative preferences between responses, the absolute quality of individual responses, and the generation process for preferred responses. The model is trained on the Multi-Modal Preference Dataset (MMPR) comprising about 3 million samples.

Guide: Running Locally

  1. Environment Setup:

    • Install the transformers library version 4.37.2 or higher.
    • Ensure you have PyTorch with GPU support.
  2. Model Loading:

    • Load the model using AutoModel.from_pretrained with options for 16-bit or 8-bit quantization.
  3. Inference:

    • Use the provided scripts to perform inference, including single-image, multi-image, and video processing.
    • For multi-GPU setups, adjust device mappings to avoid inference errors.
  4. Deployment:

    • Use lmdeploy for serving the model with RESTful APIs.
    • Consider cloud GPU services like AWS, GCP, or Azure for optimal performance.
  5. Example Code:

    import torch
    from transformers import AutoModel
    path = "OpenGVLab/InternVL2_5-8B-MPO"
    model = AutoModel.from_pretrained(
        path,
        torch_dtype=torch.bfloat16,
        low_cpu_mem_usage=True,
        use_flash_attn=True,
        trust_remote_code=True).eval().cuda()
    

License

This project is released under the MIT License. It incorporates the pre-trained internlm2_5-7b-chat, licensed under the Apache License 2.0.

More Related APIs in Image Text To Text