Intern V L2_5 78 B M P O A W Q

OpenGVLab

Introduction

InternVL2.5-MPO is an advanced multimodal large language model (MLLM) series developed by OpenGVLab, showcasing enhanced performance through Mixed Preference Optimization. This model builds upon previous versions like InternVL2.5, integrating multimodal capabilities for improved reasoning and language tasks.

Architecture

InternVL2.5-MPO retains the "ViT-MLP-LLM" paradigm from its predecessors, combining incrementally pre-trained visual transformers (InternViT) with various pre-trained large language models (LLMs) such as InternLM 2.5 and Qwen 2.5. A notable architecture feature is the pixel unshuffle operation, which reduces the number of visual tokens and supports multi-image and video data.

Training

The training process utilizes Mixed Preference Optimization (MPO), focusing on learning both the relative preferences between response pairs and the absolute quality of individual responses. The model is trained using a combination of preference loss (DPO), quality loss (BCO), and generation loss (SFT), with a large-scale dataset known as MMPR to enhance multimodal reasoning.

Guide: Running Locally

  1. Install LMDeploy: Ensure LMDeploy version 0.6.4 or higher is installed.

    pip install lmdeploy>=0.6.4
    
  2. Set Up the Pipeline: Use the pipeline functionality from LMDeploy to load and infer models, with configurations optimized for performance.

    from lmdeploy import pipeline, TurbomindEngineConfig
    from lmdeploy.vl import load_image
    
    model = 'OpenGVLab/InternVL2_5-78B-MPO-AWQ'
    pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=8192, tp=2))
    
  3. Run Inference: Load images and run inference to obtain descriptive outputs.

    image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
    response = pipe(('describe this image', image))
    print(response.text)
    
  4. Multi-Image and Batch Inference: For more complex tasks, handle multiple images or batch prompts using list structures.

  5. Cloud GPUs: For optimal performance, consider using cloud GPUs from providers like AWS, GCP, or Azure to manage the computational load efficiently.

License

InternVL2.5-MPO is released under the MIT License. Components such as the pre-trained Qwen2.5-32B-Instruct are licensed under the Apache License 2.0.

More Related APIs in Image Text To Text