Ovis1.6 Gemma2 9 B LLM Model

Introduction

We are excited to announce the open-sourcing of Ovis-1.6, our latest multi-modal large language model. Ovis is a novel Multimodal Large Language Model (MLLM) architecture, designed to structurally align visual and textual embeddings.

Architecture

Ovis1.6 builds upon Ovis1.5 by enhancing high-resolution image processing. It is trained on a larger, more diverse, and higher-quality dataset and refines the training process with DPO training following instruction-tuning. With just 10B parameters, Ovis1.6-Gemma2-9B leads the OpenCompass benchmark among open-source MLLMs within 30B parameters.

Training

Ovis1.6 uses a combined training approach that includes high-resolution image processing and diverse datasets. The training process is further refined using DPO training after instruction-tuning to improve performance.

Guide: Running Locally

Install Dependencies:

pip install torch==2.2.0 transformers==4.44.2 numpy==1.24.3 pillow==10.3.0

Load Model:

import torch
from PIL import Image
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("AIDC-AI/Ovis1.6-Gemma2-9B",
                                             torch_dtype=torch.bfloat16,
                                             multimodal_max_length=8192,
                                             trust_remote_code=True).cuda()
text_tokenizer = model.get_text_tokenizer()
visual_tokenizer = model.get_visual_tokenizer()

Run Inference:
- Enter image path and prompt, format conversation, and generate output as per the provided code snippets.
Batch Inference:
- Prepare batch inputs and process them according to the provided batch inference code.

Suggestion: For optimal performance, consider using cloud GPUs such as those offered by AWS, Google Cloud, or Azure.

License

This project is licensed under the Apache License, Version 2.0. For more details, refer to the license document.

More Related APIs in Image Text To Text