Omni V L M 968 M
NexaAIDevIntroduction
OmniVLM is a compact, sub-billion parameter multimodal model designed to process both visual and text inputs efficiently, making it ideal for edge devices. The model is built to handle tasks such as Visual Question Answering (VQA) and Image Captioning. It features a significant reduction in image tokens, improving latency and computational efficiency, and employs Direct Preference Optimization (DPO) training to minimize hallucinations.
Architecture
OmniVLM's architecture comprises three primary components:
- Base Language Model: Utilizes Qwen2.5-0.5B-Instruct for processing text inputs.
- Vision Encoder: SigLIP-400M functions at a 384 resolution with a 14×14 patch size to create image embeddings.
- Projection Layer: A Multi-Layer Perceptron (MLP) aligns the vision encoder's embeddings with the language model's token space, reducing image tokens by 9X compared to a standard Llava architecture.
Training
The training process for OmniVLM involves three stages:
- Pretraining: Establishes basic visual-linguistic alignments using image-caption pairs, focusing on learning projection layer parameters.
- Supervised Fine-tuning (SFT): Enhances contextual understanding with image-based QA datasets, training on structured image-enriched chat histories.
- Direct Preference Optimization (DPO): Involves generating responses and using a teacher model for minimally edited corrections to focus on accuracy-critical elements, forming chosen-rejected pairs for fine-tuning.
Guide: Running Locally
To run OmniVLM on your device, follow these steps:
-
Install Nexa-SDK: This framework supports local on-device inference for various models. You can install it via a Python package or executable installer.
-
Run the Model: Execute the following command in your terminal:
nexa run omniVLM
For optimal performance, consider using cloud GPUs from providers like AWS or Google Cloud, which offer the necessary computational power for running complex models efficiently.
License
OmniVLM is licensed under the Apache 2.0 License, allowing for broad usage and modification in compliance with the license terms.