Ovis1.6 Gemma2 27 B LLM Model

Introduction

Ovis1.6-Gemma2-27B is an advanced multi-modal large language model (MLLM) developed by AIDC-AI. It builds upon the Ovis1.6 framework, enhancing the capabilities of multimodal AI by significantly improving performance and capacity. It aligns visual and textual embeddings, offering enhanced model performance, advanced image processing, refined chain-of-thought reasoning, and improved document comprehension.

Architecture

The architecture of Ovis1.6-Gemma2-27B enhances high-resolution image processing and utilizes a larger, more diverse dataset. Training incorporates DPO techniques following instruction-tuning, ensuring a balance between structure and performance.

Training

The model is trained on the AIDC-AI/Ovis-dataset using the transformers library. Ovis1.6-Gemma2-27B excels in benchmarks such as OpenCompass, demonstrating top-tier performance in handling complex image-text tasks with high accuracy.

Guide: Running Locally

To run Ovis1.6-Gemma2-27B locally, follow these steps:

Install Required Libraries:

pip install torch==2.4.0 transformers==4.46.2 numpy==1.25.0 pillow==10.3.0

Load the Model:

import torch
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("AIDC-AI/Ovis1.6-Gemma2-27B",
                                             torch_dtype=torch.bfloat16,
                                             multimodal_max_length=8192,
                                             trust_remote_code=True).cuda()

Prepare Input and Generate Output: Use the code snippet provided in the usage section to process inputs and generate text outputs.
FlashAttention Support (Optional): To enhance performance, install flash-attn:
```
pip install flash-attn --no-build-isolation
```
Load the model with llm_attn_implementation='flash_attention_2'.
Batch Inference: Use the batch inference example to process multiple inputs simultaneously.

Suggested Cloud GPUs: Using cloud services like AWS EC2 with NVIDIA GPUs or Google Cloud Platform with Tensor Processing Units (TPUs) is recommended for optimal performance.

License

This project is licensed under the Apache License, Version 2.0. Usage is subject to Gemma's use restrictions, with prohibited uses outlined in the Gemma Prohibited Use Policy. Compliance with applicable laws and regulations is required, and Google reserves the right to restrict usage if terms are violated.

More Related APIs in Image Text To Text