Intern V L2_5 78 B M P O

OpenGVLab

Introduction

InternVL2.5-MPO is an advanced multimodal large language model (MLLM) series demonstrating superior performance by building upon InternVL2.5 and Mixed Preference Optimization.

Architecture

InternVL2.5-MPO retains the architecture of its predecessors and integrates a newly pre-trained InternViT with various pre-trained language models like InternLM 2.5 and Qwen 2.5. The model employs a "ViT-MLP-LLM" paradigm and uses a pixel unshuffle operation to reduce visual tokens, supporting multi-image and video data.

Training

The training process utilizes a large-scale multimodal reasoning preference dataset called MMPR and applies Mixed Preference Optimization. This involves learning preferences between response pairs, the absolute quality of responses, and the generation process for preferred responses. The training objective combines preference loss, quality loss, and generation loss using specific algorithms like DPO and BCO.

Guide: Running Locally

Basic Steps

  1. Dependencies: Ensure transformers>=4.37.2 is installed.
  2. Model Loading: Load the model using AutoModel with configurations for 16-bit or 8-bit quantization.
  3. Multiple GPUs: Implement device mapping to distribute model layers across multiple GPUs.
  4. Inference: Use the provided code snippets to perform inference with images and videos.
  5. Streaming Output: Use TextIteratorStreamer for streamed output.

Cloud GPUs

For optimal performance, use cloud GPUs. If using 8-bit quantization, two 80GB GPUs are needed. Without it, at least three 80GB GPUs are required.

License

This project is released under the MIT License and uses the pre-trained Qwen2.5-72B-Instruct, licensed under the Qwen License.

More Related APIs in Image Text To Text