Intern V L2_5 1 B M P O LLM Model

Introduction

InternVL2.5-MPO is an advanced multimodal large language model (MLLM) series that enhances performance through the integration of Mixed Preference Optimization. This model builds upon the InternVL2.5 framework, aiming to improve its capabilities in understanding and generating multimodal content.

Architecture

InternVL2.5-MPO retains the architecture of its predecessors, following the "ViT-MLP-LLM" paradigm. This involves the use of an incrementally pre-trained InternViT combined with various pre-trained large language models (LLMs), such as InternLM 2.5 and Qwen 2.5, through a newly initialized MLP projector. The model architecture also supports multi-image and video data, utilizing a pixel unshuffle operation to reduce visual tokens and a dynamic resolution strategy for processing images.

Training

The model employs a Mixed Preference Optimization (MPO) approach, which combines preference loss, quality loss, and generation loss to enhance learning. The MPO process is designed to teach the model the relative preference between response pairs, the absolute quality of each response, and the generation process for preferred responses. The preference data is sourced from the MMPR dataset, a large-scale multimodal reasoning preference dataset.

Guide: Running Locally

To run InternVL2_5-1B-MPO locally:

Install Dependencies: Ensure transformers>=4.37.2 is installed.
Model Loading:
- Use 16-bit precision for loading the model, utilizing torch.bfloat16 and enabling low_cpu_mem_usage.
- For multi-GPU setups, distribute model layers across devices to optimize performance.
Inference:
- Utilize the provided code snippets for various inference tasks, such as single and multi-image conversations or video processing.
Fine-Tuning and Deployment:
- Explore repositories like SWIFT and XTurner for fine-tuning options.
- Use LMDeploy for model deployment, offering an easy inference pipeline.

Cloud GPUs

For optimal performance, consider using cloud-based GPU services such as AWS, Google Cloud, or Azure, which provide scalable and cost-effective resources.

License

This project is released under the MIT License. It incorporates components from the Qwen2.5-0.5B-Instruct model, licensed under the Apache License 2.0.

More Related APIs in Image Text To Text