Intern V L2 8 B M P O

OpenGVLab

Introduction

Existing open-source multimodal large language models (MLLMs) often undergo pre-training and supervised fine-tuning but face challenges with distribution shifts, which limit their multimodal reasoning capabilities, especially in Chain-of-Thought (CoT) reasoning. To enhance these capabilities, a preference optimization (PO) process is introduced. This involves an automated preference data construction pipeline to create the MMPR dataset and integrates PO with MLLMs through a method called Mixed Preference Optimization (MPO), significantly improving multimodal CoT performance. InternVL2-8B-MPO achieves notable accuracy improvements over its predecessor and comparable performance to larger models.

Architecture

InternVL2-8B-MPO is based on the InternVL2-8B model and finetuned using the MMPR dataset. It demonstrates enhanced multimodal reasoning abilities and reduced hallucinations compared to InternVL2-8B.

Training

The model is trained using a large-scale multimodal reasoning preference dataset (MMPR) and employs Mixed Preference Optimization (MPO) to improve performance on multimodal reasoning tasks.

Guide: Running Locally

Basic Steps

  1. Install Requirements: Ensure transformers==4.37.2 is installed.
  2. Model Loading: Use the following code to load the model:
    import torch
    from transformers import AutoTokenizer, AutoModel
    
    path = "OpenGVLab/InternVL2-8B-MPO"
    model = AutoModel.from_pretrained(
        path,
        torch_dtype=torch.bfloat16,
        low_cpu_mem_usage=True,
        use_flash_attn=True,
        trust_remote_code=True).eval().cuda()
    
  3. Inference: Utilize the provided transformations and tokenizers for image and text inputs to run the model.

Cloud GPUs

For efficient performance, consider using cloud GPU services such as AWS EC2, Google Cloud, or Azure.

License

This project is released under the MIT license. InternLM2 is licensed under the Apache-2.0 license.

More Related APIs in Image Text To Text