Intern V L2_5 4 B M P O
OpenGVLabIntroduction
InternVL2.5-MPO is an advanced multimodal large language model (MLLM) series that enhances performance by building upon InternVL2.5 and Mixed Preference Optimization (MPO). It integrates an incrementally pre-trained InternViT with various pre-trained large language models, including InternLM 2.5 and Qwen 2.5, using a randomly initialized MLP projector.
Architecture
InternVL2.5-MPO retains the "ViT-MLP-LLM" architecture paradigm, integrating pixel unshuffle operations to reduce visual tokens and adopting a dynamic resolution strategy by dividing images into 448×448 pixel tiles. It supports multi-image and video data to enhance its multimodal capabilities.
Training
The training process involves Mixed Preference Optimization (MPO), which combines preference loss, quality loss, and generation loss. This allows the model to learn the relative preferences between response pairs, the absolute quality of individual responses, and the generation process of preferred responses. A large-scale multimodal reasoning preference dataset (MMPR) with about 3 million samples is used for training.
Guide: Running Locally
Basic Steps
-
Model Loading:
import torch from transformers import AutoTokenizer, AutoModel path = "OpenGVLab/InternVL2_5-4B-MPO" model = AutoModel.from_pretrained( path, torch_dtype=torch.bfloat16, low_cpu_mem_usage=True, use_flash_attn=True, trust_remote_code=True).eval().cuda()
-
Inference with Transformers: Load and preprocess images or video frames, and use the model to generate responses to text or image inputs.
-
Multiple GPUs: Distribute model layers across multiple GPUs to handle large-scale inference tasks efficiently.
-
Cloud GPUs: Consider using cloud GPU services like AWS, Google Cloud, or Azure to leverage powerful computational resources for training and inference.
License
This project is released under the MIT License, utilizing the pre-trained Qwen2.5-3B-Instruct, which is licensed under the Apache License 2.0.