Ovis1.6 Gemma2 9 B
AIDC-AIIntroduction
We are excited to announce the open-sourcing of Ovis-1.6, our latest multi-modal large language model. Ovis is a novel Multimodal Large Language Model (MLLM) architecture, designed to structurally align visual and textual embeddings.
Architecture
Ovis1.6 builds upon Ovis1.5 by enhancing high-resolution image processing. It is trained on a larger, more diverse, and higher-quality dataset and refines the training process with DPO training following instruction-tuning. With just 10B parameters, Ovis1.6-Gemma2-9B leads the OpenCompass benchmark among open-source MLLMs within 30B parameters.
Training
Ovis1.6 uses a combined training approach that includes high-resolution image processing and diverse datasets. The training process is further refined using DPO training after instruction-tuning to improve performance.
Guide: Running Locally
-
Install Dependencies:
pip install torch==2.2.0 transformers==4.44.2 numpy==1.24.3 pillow==10.3.0
-
Load Model:
import torch from PIL import Image from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("AIDC-AI/Ovis1.6-Gemma2-9B", torch_dtype=torch.bfloat16, multimodal_max_length=8192, trust_remote_code=True).cuda() text_tokenizer = model.get_text_tokenizer() visual_tokenizer = model.get_visual_tokenizer()
-
Run Inference:
- Enter image path and prompt, format conversation, and generate output as per the provided code snippets.
-
Batch Inference:
- Prepare batch inputs and process them according to the provided batch inference code.
Suggestion: For optimal performance, consider using cloud GPUs such as those offered by AWS, Google Cloud, or Azure.
License
This project is licensed under the Apache License, Version 2.0. For more details, refer to the license document.