Mini C P M Llama3 V 2_5 LLM Model

Introduction

MiniCPM-Llama3-V 2.5 is a cutting-edge, multilingual, multimodal language model that operates at a GPT-4V level. It is designed for diverse applications, including OCR and multilingual processing, and can run efficiently on mobile devices. The model is built on the SigLip-400M and Llama3-8B-Instruct architectures with 8 billion parameters.

Architecture

MiniCPM-Llama3-V 2.5 combines the strengths of SigLip-400M and Llama3-8B-Instruct. It supports over 30 languages and offers advanced OCR capabilities, processing images with up to 1.8 million pixels. It includes optimizations for efficient deployment on edge devices, utilizing model quantization and NPU acceleration.

Training

The model is trained using extensive datasets, including the openbmb/RLAIF-V-Dataset, and employs the latest RLAIF-V method for improved trustworthiness and reduced hallucination rates. It supports LoRA fine-tuning with minimal GPU resources, enhancing its adaptability for various applications.

Guide: Running Locally

Requirements:
- Python 3.10
- Install the following packages: Pillow==10.1.0, torch==2.1.2, torchvision==0.16.2, transformers==4.40.0, sentencepiece==0.1.99
Setup:
- Use the transformers library for model inference on NVIDIA GPUs.
- Example script provided for running inference with image input and text output.
Deployment Options:
- Run with llama.cpp for CPU inference.
- Use the INT4 quantized version for lower GPU memory usage.
- For cloud GPUs, consider services like AWS or Google Cloud with NVIDIA V100 or similar GPUs.

License

The code is released under the Apache-2.0 License. The model weights are free for academic research and commercial use after registration. Users must adhere to the MiniCPM Model License terms. The developers are not liable for any misuse of the model.

More Related APIs in Image Text To Text