Intern V L2 8 B M P O
OpenGVLabIntroduction
Existing open-source multimodal large language models (MLLMs) often undergo pre-training and supervised fine-tuning but face challenges with distribution shifts, which limit their multimodal reasoning capabilities, especially in Chain-of-Thought (CoT) reasoning. To enhance these capabilities, a preference optimization (PO) process is introduced. This involves an automated preference data construction pipeline to create the MMPR dataset and integrates PO with MLLMs through a method called Mixed Preference Optimization (MPO), significantly improving multimodal CoT performance. InternVL2-8B-MPO achieves notable accuracy improvements over its predecessor and comparable performance to larger models.
Architecture
InternVL2-8B-MPO is based on the InternVL2-8B model and finetuned using the MMPR dataset. It demonstrates enhanced multimodal reasoning abilities and reduced hallucinations compared to InternVL2-8B.
Training
The model is trained using a large-scale multimodal reasoning preference dataset (MMPR) and employs Mixed Preference Optimization (MPO) to improve performance on multimodal reasoning tasks.
Guide: Running Locally
Basic Steps
- Install Requirements: Ensure
transformers==4.37.2
is installed. - Model Loading: Use the following code to load the model:
import torch from transformers import AutoTokenizer, AutoModel path = "OpenGVLab/InternVL2-8B-MPO" model = AutoModel.from_pretrained( path, torch_dtype=torch.bfloat16, low_cpu_mem_usage=True, use_flash_attn=True, trust_remote_code=True).eval().cuda()
- Inference: Utilize the provided transformations and tokenizers for image and text inputs to run the model.
Cloud GPUs
For efficient performance, consider using cloud GPU services such as AWS EC2, Google Cloud, or Azure.
License
This project is released under the MIT license. InternLM2 is licensed under the Apache-2.0 license.