Intern V L2_5 38 B M P O LLM Model

Introduction

InternVL2.5-MPO is an advanced multimodal large language model (MLLM) series that builds upon InternVL2.5 and Mixed Preference Optimization (MPO), offering superior performance across various tasks.

Architecture

InternVL2.5-MPO retains the architecture of previous versions, following the "ViT-MLP-LLM" paradigm. It integrates a newly pre-trained InternViT with various pre-trained large language models (LLMs) like InternLM 2.5 and Qwen 2.5 using a randomly initialized MLP projector. The model supports multi-image and video data through a dynamic resolution strategy and pixel unshuffle operation to reduce visual tokens.

Training

The training process involves a large-scale Multi-Modal Preference Dataset (MMPR) with about 3 million samples. The Mixed Preference Optimization (MPO) method combines preference loss, quality loss, and generation loss to enable the model to learn the relative preference and absolute quality of responses, as well as the process for generating preferred responses.

Guide: Running Locally

To run the InternVL2_5-38B-MPO model locally:

Environment Setup:
- Ensure transformers>=4.37.2 is installed.
- Use at least two 80GB GPUs if not using 8-bit quantization.

Model Loading:

Load the model using PyTorch with a specific torch data type and memory settings.

import torch
from transformers import AutoTokenizer, AutoModel
path = "OpenGVLab/InternVL2_5-38B-MPO"
model = AutoModel.from_pretrained(
    path,
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    use_flash_attn=True,
    trust_remote_code=True).eval().cuda()

Multi-GPU Setup:
- Define a device map to distribute model layers across available GPUs.
Inference:
- Use the provided scripts to run inference on images or videos.
Cloud GPUs:
- For enhanced performance, consider using cloud GPU services such as AWS, Google Cloud, or Azure, which offer powerful GPUs suitable for running large models.

License

The InternVL2.5-MPO project is released under the MIT License. It includes the Qwen2.5-32B-Instruct component licensed under the Apache License 2.0.

More Related APIs in Image Text To Text