Intern V L2_5 4 B M P O LLM Model

Introduction

InternVL2.5-MPO is an advanced multimodal large language model (MLLM) series that enhances performance by building upon InternVL2.5 and Mixed Preference Optimization (MPO). It integrates an incrementally pre-trained InternViT with various pre-trained large language models, including InternLM 2.5 and Qwen 2.5, using a randomly initialized MLP projector.

Architecture

InternVL2.5-MPO retains the "ViT-MLP-LLM" architecture paradigm, integrating pixel unshuffle operations to reduce visual tokens and adopting a dynamic resolution strategy by dividing images into 448×448 pixel tiles. It supports multi-image and video data to enhance its multimodal capabilities.

Training

The training process involves Mixed Preference Optimization (MPO), which combines preference loss, quality loss, and generation loss. This allows the model to learn the relative preferences between response pairs, the absolute quality of individual responses, and the generation process of preferred responses. A large-scale multimodal reasoning preference dataset (MMPR) with about 3 million samples is used for training.

Guide: Running Locally

Basic Steps

Model Loading:

import torch
from transformers import AutoTokenizer, AutoModel

path = "OpenGVLab/InternVL2_5-4B-MPO"
model = AutoModel.from_pretrained(
    path,
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    use_flash_attn=True,
    trust_remote_code=True).eval().cuda()

Inference with Transformers: Load and preprocess images or video frames, and use the model to generate responses to text or image inputs.
Multiple GPUs: Distribute model layers across multiple GPUs to handle large-scale inference tasks efficiently.
Cloud GPUs: Consider using cloud GPU services like AWS, Google Cloud, or Azure to leverage powerful computational resources for training and inference.

License

This project is released under the MIT License, utilizing the pre-trained Qwen2.5-3B-Instruct, which is licensed under the Apache License 2.0.

More Related APIs in Image Text To Text