Intern V L2_5 78 B M P O
OpenGVLabIntroduction
InternVL2.5-MPO is an advanced multimodal large language model (MLLM) series demonstrating superior performance by building upon InternVL2.5 and Mixed Preference Optimization.
Architecture
InternVL2.5-MPO retains the architecture of its predecessors and integrates a newly pre-trained InternViT with various pre-trained language models like InternLM 2.5 and Qwen 2.5. The model employs a "ViT-MLP-LLM" paradigm and uses a pixel unshuffle operation to reduce visual tokens, supporting multi-image and video data.
Training
The training process utilizes a large-scale multimodal reasoning preference dataset called MMPR and applies Mixed Preference Optimization. This involves learning preferences between response pairs, the absolute quality of responses, and the generation process for preferred responses. The training objective combines preference loss, quality loss, and generation loss using specific algorithms like DPO and BCO.
Guide: Running Locally
Basic Steps
- Dependencies: Ensure
transformers>=4.37.2
is installed. - Model Loading: Load the model using
AutoModel
with configurations for 16-bit or 8-bit quantization. - Multiple GPUs: Implement device mapping to distribute model layers across multiple GPUs.
- Inference: Use the provided code snippets to perform inference with images and videos.
- Streaming Output: Use
TextIteratorStreamer
for streamed output.
Cloud GPUs
For optimal performance, use cloud GPUs. If using 8-bit quantization, two 80GB GPUs are needed. Without it, at least three 80GB GPUs are required.
License
This project is released under the MIT License and uses the pre-trained Qwen2.5-72B-Instruct, licensed under the Qwen License.