Intern V L2_5 8 B M P O
OpenGVLabIntroduction
InternVL2.5-MPO is an advanced multimodal large language model (MLLM) series that builds upon InternVL2.5 and Mixed Preference Optimization. It is designed to demonstrate superior performance across various tasks by integrating visual and textual data processing capabilities.
Architecture
InternVL2.5-MPO retains the "ViT-MLP-LLM" architecture paradigm from its predecessors. It integrates a pre-trained InternViT with large language models like InternLM 2.5 and Qwen 2.5 through a randomly initialized MLP projector. The model employs pixel unshuffle operations to reduce visual tokens and uses a dynamic resolution strategy to process images as 448×448 pixel tiles. It supports multi-image and video data inputs.
Training
The training process uses Mixed Preference Optimization (MPO), which combines preference loss, quality loss, and generation loss. This approach helps the model learn relative preferences between responses, the absolute quality of individual responses, and the generation process for preferred responses. The model is trained on the Multi-Modal Preference Dataset (MMPR) comprising about 3 million samples.
Guide: Running Locally
-
Environment Setup:
- Install the
transformers
library version 4.37.2 or higher. - Ensure you have PyTorch with GPU support.
- Install the
-
Model Loading:
- Load the model using
AutoModel.from_pretrained
with options for 16-bit or 8-bit quantization.
- Load the model using
-
Inference:
- Use the provided scripts to perform inference, including single-image, multi-image, and video processing.
- For multi-GPU setups, adjust device mappings to avoid inference errors.
-
Deployment:
- Use
lmdeploy
for serving the model with RESTful APIs. - Consider cloud GPU services like AWS, GCP, or Azure for optimal performance.
- Use
-
Example Code:
import torch from transformers import AutoModel path = "OpenGVLab/InternVL2_5-8B-MPO" model = AutoModel.from_pretrained( path, torch_dtype=torch.bfloat16, low_cpu_mem_usage=True, use_flash_attn=True, trust_remote_code=True).eval().cuda()
License
This project is released under the MIT License. It incorporates the pre-trained internlm2_5-7b-chat
, licensed under the Apache License 2.0.