Intern V L2_5 1 B M P O
OpenGVLabIntroduction
InternVL2.5-MPO is an advanced multimodal large language model (MLLM) series that enhances performance through the integration of Mixed Preference Optimization. This model builds upon the InternVL2.5 framework, aiming to improve its capabilities in understanding and generating multimodal content.
Architecture
InternVL2.5-MPO retains the architecture of its predecessors, following the "ViT-MLP-LLM" paradigm. This involves the use of an incrementally pre-trained InternViT combined with various pre-trained large language models (LLMs), such as InternLM 2.5 and Qwen 2.5, through a newly initialized MLP projector. The model architecture also supports multi-image and video data, utilizing a pixel unshuffle operation to reduce visual tokens and a dynamic resolution strategy for processing images.
Training
The model employs a Mixed Preference Optimization (MPO) approach, which combines preference loss, quality loss, and generation loss to enhance learning. The MPO process is designed to teach the model the relative preference between response pairs, the absolute quality of each response, and the generation process for preferred responses. The preference data is sourced from the MMPR dataset, a large-scale multimodal reasoning preference dataset.
Guide: Running Locally
To run InternVL2_5-1B-MPO locally:
- Install Dependencies: Ensure
transformers>=4.37.2
is installed. - Model Loading:
- Use 16-bit precision for loading the model, utilizing
torch.bfloat16
and enablinglow_cpu_mem_usage
. - For multi-GPU setups, distribute model layers across devices to optimize performance.
- Use 16-bit precision for loading the model, utilizing
- Inference:
- Utilize the provided code snippets for various inference tasks, such as single and multi-image conversations or video processing.
- Fine-Tuning and Deployment:
- Explore repositories like SWIFT and XTurner for fine-tuning options.
- Use
LMDeploy
for model deployment, offering an easy inference pipeline.
Cloud GPUs
For optimal performance, consider using cloud-based GPU services such as AWS, Google Cloud, or Azure, which provide scalable and cost-effective resources.
License
This project is released under the MIT License. It incorporates components from the Qwen2.5-0.5B-Instruct model, licensed under the Apache License 2.0.